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Visual Tracking: IV. Interrelations of Target Speed and Aided- 
Tracking Ratio in Defining Tracking Accuracy * 


Betty E. Pearl, J. Richard Simon, and Karl U. Smith 


University of Wisconsin 


Many everyday forms of adjustment, as 
well as many specialized skills in military and 
industrial situations, depend upon accuracy 
in. visual tracking. The present study deals 
with a major problem in the design of con- 
trols for such tracking devices. This is the 
problem of the time constant for aided track- 
ing. In a broad sense, this study is concerned 
with the general problem of the instrumental 
relations of human motion. 

In the typical aided pursuit-tracking task, 
the operator causes a cursor or follower to 
“track” a moving target by appropriately ad- 
justing his handwheel control. Adjusting the 
control not only positions the cursor directly 
but also generates a rate in a motor which 
then drives the cursor independently of fur- 
ther control adjustments by the operator. 
The ratio between the change in velocity of 
the cursor and the amount of direct displace- 
ment of the cursor per unit of handwheel ro- 
tation is known as the aided-tracking time 
constant. It expresses the amount of time re- 
quired for the rate-control motor to drive the 
cursor to the target, regardless of the initial 
distance separating target and cursor.* 

The aided-tracking time constant has pre- 
sented a difficult problem to both engineers 
and psychologists since the early days of 


1 This research has been supported by funds voted 
by the Legislature of the State of Wisconsin, and 
assigned by the Graduate School Research Commit- 
tee, the University of Wisconsin. 

2For example, a time constant of 0.5 sec. means 
that a rotation of the handwheel which produces a 
displacement in 10° in the position of the cursor 
simultaneously generates a rate of cursor movement 
of 20° per sec. A time constant of 0.25 sec. means 
that a certain rotation of the handwheel produces 
simultaneously a displacement of 10° in the position 
of the cursor and a rate of cursor movement of 40° 
per sec. 


World War II. The surface problem is: 
what aiding ratio should be built into track- 
ing devices in order to produce maximal ac- 
curacy by the tracker? Although different 
researchers have found various aiding ratios 
to be optimum, there is a widespread belief 
that there is an optimum time constant in the 
neighborhood of 0.5 second. 

Aside from the applied problem of the 
choice of the time constant in aided tracking, 
there are broad theoretical questions of the 
psychological determination of tracking be- 
havior and of the instrumental relationships 
of such behavior. Two main assumptions 
have been widely accepted about the nature 
of aided tracking that are of importance to 
both the theoretical and applied problems 
named. One of these assumptions is that 
aided tracking, except under unusual circum- 
stances, is superior to direct tracking. Lin- 
coln and Smith (7) have shown that this as- 
sumption is incorrect as it applies to pursuit 
tracking. 

The other assumption underlying much cur- 
rent thinking about aided tracking is that the 
time constant is an invariant optimum over 
a wide range of conditions. Specific theo- 
retical notions have been proposed to explain 
such an invariant optimum. Mechler, Rus- 
sell, and Preston (9) proposed a reaction- 
time theory. They believed that the time 
constant corresponded roughly to the time of 
detection by the operator of an error align- 
ment between target and cursor and response 
to this error. A more general statement of 
this view has been given by Searle. The ex- 
pression “intermittency hypothesis” (10) has 
been used to identify the belief that tracking 
behavior represents a series of discrete per- 
ceptions and that the response time for these 
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discrete reactions corresponds to the optimum 
aided-tracking time constant. 

The primary purpose of this study is to 
evaluate the assumption that the aided-track- 
ing time constant is a relatively fixed value 
for a wide range of conditions of tracking. 
In conducting this experiment, the accuracy 
of tracking for different time constants has 
been measured at several speeds of target 
movement. 


Procedure 
Apparatus 


Figure 1 is a schematic drawing of the tracking 
device used in this study. The features of the ap- 
paratus have been described previously in detail (5). 
The apparatus consists of five separate units which 
together make up the linked tracking system. They 
are (a) a target-cursor display, (b) a handwheel 
control, (c) a universal tracking control, (d) a 
target path generating system, and (e) an error re- 
cording system. 

The critical elements of the tracking system for 
purposes of this study are the universal tracking con- 
trol and the target path generating system. The 
combination of rate and direct control which occurs 
in aided tracking is accomplished by means of a 
differential within the universal tracking control. A 
variable-speed motor permits variation in the rate of 
cursor movement per unit of handwheel rotation. It 
is by means of the speed control of this driving 
motor that variations in the aided-tracking time 
constant are introduced. 

The target generating system drives the target 
through a complicated path which involves nine re- 
versals of direction and continuous variation in tar- 
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get velocity. An over-all velocity change in the tar- 
get may be brought about by changing the setting 
of a variable-speed motor. Since target velocity is 
continually changing, the target-speed variable in 
the present study is specified in terms of revolutions 
per minute of this variable-speed motor. 

The error-recording system (6) consists of a spe- 
cially designed Selsyn differential which continuously 
compares the position of the cursor with the position 
of the target. An integrating device automatically 
weights the tracking errors according to their mag- 
nitude and converts the error scores into accuracy 
scores which are registered on a precision clock. 


Experimental Design and Procedure 


The main data of this study consist of tracking 
accuracy scores for 27 subjects (Ss) over a four-day 
period. Two independent variables are manipulated. 
They are speed of target movement and aided-track- 
ing time constant. Three target speeds, specified in 
terms of revolutions per minute of a variable-speed 
motor, are used. The target speeds selected are 23 
r.p.m., 30 r.p.m., and 37 r.p.m. The second inde- 
pendent variable, time constant or aided-tracking 
ratio, is introduced through adjustment of the uni- 
versal tracking control. The time constants selected 
are .25 sec., 0.5 sec., and 1.0 sec. The nine com- 
binations of target speed and aided-tracking time 
constant make up the nine experimental conditions 
in the present study. A completely random 9 X 9 
latin square is selected to which the nine experimen- 
tal conditions are assigned. 

Twenty-seven right-handed students are used as 
Ss. Three Ss are randomly assigned to each one of 
the nine different sequences of condition occurring in 
the rows of the 9 X 9 latin square so that each S 
performs on each experimental condition. The Ss 
continue to perform in the sequence to which they 








CONTROLLED TRACKING DEVICE 








Fic. 1. Controlled tracking device. 


The operator sits in the chair at the left 


and adjusts a handwheel which is connected to a cursor by means of a shaft. 
The operator’s task is to keep the cursor aligned with the moving target by 


means of turning the handwheel control. 


The experimenter sits in the chair at 


the right near the recording clock and is hidden from the operator’s view by a 


screen while the trials are in progress. 
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are originally assigned for the four days of the ex- 
periment. The experimental design takes the form 
of a 3 X3 factorial in the cells of a replicated 9 X 9 
latin square. It affords the simultaneous control of 
individual differences and the order of presentation 
of the experimental conditions. 

On Day 1 the tracking task is explained to the 
subject. He is permitted to practice operating the 
handwheel control with the target remaining sta- 
tionary so that he may get the “feel” of the task 
before beginning the test trials. A buzzer is used as 
a ready signal. Each trial starts with the target and 
cursor aligned and the handwheel in the neutral po- 
sition. The apparatus switches off automatically at 
the end of a one-minute trial. The intertrial in- 
terval is of the order of 30 sec. 


Results 


In order to appraise the combined effects of 
target speed and aided-tracking time constant 
on the tracking accuracy of skilled operators, 
the data obtained on Day 4 have been ana- 
lyzed. This is the final day of training in 
the present experiment. Prior investigation 
(7) in the tracking situation used here has 
shown that learning of the tracking task is 
completed after about four days. 

Figure 2 pictures mean accuracy scores for 
the three target speeds and the three time 
constants on Day 4. The mean score for each 
target speed includes scores for all time con- 
stants and the mean score for each time con- 
stant includes scores for all target speeds. 
Greatest accuracy is obtained at the 23 r.p.m. 
or low target speed and the least accuracy is 
obtained at the 37 r.p.m. or high target speed. 
Similarly, the aided-tracking time constant of 
0.5 sec. produces highest over-all accuracy. 
The time constant of 0.25 sec. given an in- 
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Fic. 2. Mean accuracy scores for the three target 
speeds and the three aided-tracking time constants. 


' stant. 


SECONDS 


TIME CONSTANT 
—- 0.50SEC. % 


ACCURACY 


—-—~ 0.25SEC. 
1——1 1.0 SEC. ~ 





~s 


37R.PRM. 





23 R.PM. 3OR.PM. 
TARGET SPEED 

Fic. 3. Interaction of target speed and aided- 
tracking time constant. Differences in tracking ac- 
curacy at the various time constants depend on the 
speed of target movement. 


termediate accuracy score and the 1.0-sec. 
time constant produced lowest accuracy. 

Figure 3 pictures the interaction effects of 
target speed and aided-tracking time constant 
on tracking accuracy. Accuracy is plotted as 
a function of target speed for each of the 
separate time constants. It may be noted 
that accuracy decreases as a function of tar- 
get speed regardless of the time constant 
used. But note also that the relative differ- 
ences between the curves for the three time 
constants change as a function of target 
speed. It is the failure of these differences to 
be alike which results in a significant inter- 
action between target speed and time con- 
This figure indicates that differences 
observed between the time constants depend 
on the particular target speed used. The 
curve for the time constant in 1.0 sec. shows 
a much sharper decrease with increasing tar- 
get speed than do the other two curves and 
the significant interaction between target 
speed and time constant seems to be due 
mainly to this effect. 

Table 1 summarizes the analysis of vari- 
ance of Day 4 scores. A Bartlett chi-square 
test indicates heterogeneity of variance in the 
error term (2). In order to compensate for 
this heterogeneity of variance, the degrees of 
freedom for the residual error are cut in half 
for evaluating the F ratio. This approximate 
procedure reduces the power of the F test so 
that it is no more powerful than it should be. 
In this analysis, the square uniqueness mean 
square is significantly greater than the re- 
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Table 1 


Summary of Analysis of Variance 








Source df SS 


Target speed 2 ~=1,002.87 
Time constant 2 1,003.94 
Speed X constant 

interaction 4 
Trials 8 
Sequences 8 
Residual between 

Ss 18 
Square unique- 

ness 56 


637.90 
Residual error 144 756.85 


242 7,275.98 


MS 


501.43 
501.97 





208.88 
40.10 
1,757.48 


52.22 
5.01 
219.69 
1,867.96 103.78 


11.39 
5.26 





* Significant at the 5% level of confidence. 


sidual error mean square. Therefore, square 
uniqueness is used to test the significance of 
the treatment mean square and a significant 
F results. The treatment mean square is then 
broken down into its components, target 
speed, time constant, and target speed x time 
constant interaction. Since the target speed 
xX time constant interaction is significant, it 


becomes the error term to test the target 


speed and time constant main effects. Both 
speed and constant prove to be significant 
over and above this interaction. This fact is 
indicated in Table 1 by the significant F 
values of 9.60 and 9.61. 

These significant F’s may arise from two 
conditions: (a) a significant difference be- 
tween the within-treatment variances, and (0) 
a significant difference between treatment 
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Fic. 4. Mean accuracy scores for each combina- 

tion of target speed and aided-tracking time constant. 

Four significant gaps between the ranked treatment 
means are indicated. 


means. If one were justified in accepting the 
hypothesis of homogeneity of variance, he 
could conclude that the significant F’s for 
target speed and time constant arose from 
significant differences between the means. 
However, Bartlett chi-square tests indicate 
that the hypothesis of homogeneity must be 
rejected, and we are left undecided as to the 
true source of ‘the significant variations ob- 
served. There are no available statistical 
techniques to enable an experimenter to de- 
termine whether group means differ when 
there is a significant F for groups and signifi- 
cant heterogeneity of within-groups variance. 

A bar graph of the mean accuracy scores of 
the nine combinations of target speed and 
aided-tracking time constant is shown in 
Fig. 4. The scores have been ranked from 
high to low. Since the analysis of variance 
shows that both target speed and time con- 
stant significantly affect tracking accuracy, it 
is desirable to know which of the.nine sepa- 
rate treatments differ significantly from each 
other. 

A Duncan range test (1) was performed 
which tests the significance of difference be- 
tween ranked treatments in the analysis of 
variance. Essentially, the Duncan test de- 
termines the number of significant gaps be- 
tween ranked treatment means. Four gaps 
are found in the data, all of which are signifi- 
cant at the 5% level of confidence. 

It can be noted in Fig. 4 that the combina- 
tion of a target speed of 23 r.p.m. and a time 
constant of 0.5 sec. gives the highest ac- 
curacy score. This mean differs significantly 
from the fourth-ranked mean, the 30 r.p.m. 
target speed with a time constant of 0.25 sec., 
and all other means having a rank higher 
than 4. It does not, however, differ sig- 
nificantly from the second- or third-ranked 
means. The third-ranked mean for the target 
speed of 30 r.p.m., and time constant of 0.5 
sec., differs significantly from the sixth-ranked 
mean and all means which have a rank higher 
than 6, but it does not differ significantly 
from the means in the fourth and fifth rank. 

The combination of the target speed of 37 
r.p.m. with the 1.0-sec. time constant pro- 
duces the lowest accuracy score of the nine 
treatments. The Duncan test demonstrates 
that this condition is significantly different 
from all other conditions. The eighth-ranked 
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mean, for the 30 r.p.m. target speed paired 
with the 1.0-sec. time constant, is also found 
to differ significantly from all other condi- 
tions. These two conditions are the only 
ones that differ significantly from all other 
conditions. 


Discussion 


This study has been concerned with track- 
ing accuracy as it is affected by target speed 
and aided-tracking time constant. The prac- 
tical importance of the study does not lie in 
the effect of the independent variables per se, 
since these variables have long been recog- 
nized as critical determinants of tracking ac- 
curacy (5, 8,9, 10). Rather, the importance 
of the study lies in the fact that it permits 
an evaluation of the critical interaction be- 
tween the aided-tracking time constant and 
another characteristic of the tracking situa- 
tion. In this experiment, speed of target 
movement is the other main variable manipu- 
lated. 

A main problem in the design of aided- 
tracking devices is what magnitude of aided- 
tracking time constant should be built into 


the system to provide the human operator 
with maximum aid in his task of following a 


moving target. Several studies have shown 
that the optimum aided-tracking time con- 
stant is in the neighborhood of 0.5 sec. and 
the notion has become widespread that this 
figure of 0.5 sec. represents an invariant 
optimum which applies over a wide range of 
conditions. Following this notion and with- 
out further research, engineers have incorpo- 
rated this constant into the design of new 
aided-tracking systems. Thus far, one main 
type of theory has been advanced which at- 
tempts to explain this so-called optimum con- 
stant of 0.5 sec. in terms of intermittency of 
visual perception and the latency of human 
adjustive response (9). 

Results of the present study support past 
research in that the 0.5-sec. time constant 
remains optimum over the range of target 
speeds used. However, in no case does the 
0.5-sec. time constant produce significantly 
greater accuracy than the .25-sec. time con- 
stant at the same target speed. The 1.0-sec. 
time constant, on the other hand, is signifi- 
cantly inferior to the other two time con- 
stants at all three target speeds. 
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Perhaps a more important finding is the 
significant interaction between aided-track- 
ing ratio and target speed. This means that 
the relative effectiveness of a given time con- 
stant depends on the particular target speed 
used. It at least suggests that other aided- 
tracking ratios may become optimum as 
characteristics of the tracking situation are 
changed. In the light of the present results, 
it would seem unwarranted to assume that an 
aided-tracking time constant which is opti- 
mum under one set of conditions would re- 
main optimum when such variables as com- 
plexity of target course, type of operator con- 
trol, and type of tracking are manipulated. 
Since time constant interacts with target 
speed, it would not be surprising to find that 
the optimum time constant depends on these 
other variables as well. Until further ana- 
lytical research is forthcoming, the particular 
time constant used in tracking devices should 
be experimentally determined in terms of the 
actual task confronting the operator rather 
than arbitrarily selected in terms of some 
supposedly universal generalizations about 
human behavior. 

The present results, together with the re- 
lated studies of Lincoln and Smith (7), re- 
open the question of the theoretical basis of 
tracking and of aided tracking. The reaction- 
time or intermittency theory does not fit the 
present data inasmuch as there are significant 
functional relations between the aided-track- 
ing time constant and target speed. The 
previous studies of Lincoln and Smith sug- 
gest that direct pursuit tracking which in- 
volves an aiding constant of infinite value is 
superior to aided pursuit tracking with a 0.5- 
sec. time constant. There are probably many 
values of the aided-tracking time constant 
which are either superior to or not signifi- 
cantly different from the time constant of 
0.5 sec. 

If the intermittency hypothesis is incor- 
rect, as we believe it to be, what alternative 
notions may be proposed? We suggest an 
idea which may be considered as a motion 
resonance hypothesis of tracking. Tracking 
behavior is composed of two basic component 
movements, a rapid oscillatory positioning or 
manipulative movement, and a slower rate- 
control or travel movement. The dynamics 
of these movements as a function of target 





214 


speed represent the primary determination of 
tracking error inasmuch as the frequency of 
these movements are fundamentally defined 
by orbits of body motion traveling at a de- 
fined speed. Aided tracking, which involves 
alteration in the perceptual characteristics of 
the cursor, will affect mainly the frequency of 
error in the more rapid positioning movement. 
Specifically, the rate aid in tracking will act 
as a filter to eliminate a certain percentage 
of the faster positioning movements. This 
filtering process will vary as a function of tar- 
get speed since the frequency of positioning 
errors will also vary as a function of target 
speed. The filtering effect or aid will be 
optimum when it provides the least interfer- 
ence with the rate control movements and 
when positioning movements constitute the 
major error in the tracking task. These con- 
ditions will be found with slow target speeds 
or linear target courses, which are the circum- 
stances that in general have been found to 
produce superior tracking with aiding devices. 

The theory just presented also predicts that 
an aiding device will be of considerable sig- 
nificance in compensatory tracking because in 


this form of tracking the travel component 
or rate-control motions have been eliminated 


in large part by zeroing the target. This 
hypothetical view of tracking represents an 
aspect of a general theory of motion that has 
significance in accounting for many aspects of 
discrete movements (3, 4, 11). 


Summary 


The subject of this investigation was the 
effect of variation in target speed and aided- 
tracking time constant on accuracy of visual 
tracking. A specially designed aided-pursuit 
tracking device was used. The task for the 
subject was to keep a cursor or follower 
aligned with a moving target by adjusting a 
handwheel control. Variations in the over- 
all target speed were introduced by adjusting 
a variable-speed motor to speeds of 23, 30, 
and 37 r.pm. Three aided-tracking time 
constants of .25, 0.5, and 1.0 sec. were used. 
Each of twenty-seven Ss performed on all 
nine combinations of target speed and time 
constant. Time scores which integrated time 
on target and magnitude of error provided the 
measure of tracking accuracy. 

The time constant of 0.5 sec. remained opti- 
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mum over the range of target speeds used in 
this study although there was not a statisti- 
cally significant difference between the .25- 
and 0.5-sec. time constant at any of the three 
target speeds. The 1.0-sec. time constant was 
significantly inferior at all three target speeds. 
The main finding of the study was the sig- 
nificant interaction between aided-tracking 
time constant and target speed. In other 
words, the relative effectiveness of a given 
time constant depends on the particular tar- 
get speed used. 

A motion resonance theory of tracking was 
proposed to account for the main phenomena 
of aided tracking. 


Received August 27, 1954. 
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Multiple-Dial Check Reading: Pointer Symmetry Compared 
with Uniform Alignment 


Sherman Ross, L. T. Katchmar, and Harold Bell 


University of Maryland 


Recent reports have indicated that in many 
situations aircraft instrument displays are fre- 
quently read with a view toward discriminat- 
ing deviations from a normal or desired po- 
sition (check reading) rather than determin- 
ing precise quantitative readings (1, 2, 3). 
With multiple-dial displays it has been found 
that a rectangular patterning of dials, which 
allows for uniform pointer alignment, in- 
creases the speed and accuracy of check read- 
ing (2). Senders (4) has reported that 
qualitative reading time is approximately a 
linear function of the number of dials on the 
panel. This linear relationship apparently 
holds regardless of pointer orientation. Within 
the positions investigated the smallest slope 
was obtained with the pointers aligned at the 
9 o’clock position. Along similar lines White 
and his co-workers (5) have shown that in a 
16-dial display check-reading time is rela- 
tively independent of pointer alignment when 
the pointers are uniformly aligned at the 3, 
6, 9, or 12 o'clock positions. For qualitative 
reading, uniform alignment at the 9 o’clock 
position provided for the best performance. 
These workers point out that the apparent 
superiority of the 9 o’clock position may have 
been artifactual. 

While uniform pointer alignment in multi- 
ple-dial displays has been found to facilitate 
reading time and accuracy, Johnsgard (3) re- 
ports that a symmetrical alignment of point- 
ers in the 3, 6, 9, and 12 o’clock positions 
“facilitate check-reading equally as well as do 
panels with uniform alignment” (3, p. 410). 
His results also indicate that pointer sym- 
metry may be superior to uniform alignment 
with additional practice. 

This report presents the results of two re- 
lated experiments concerned with testing the 
difference between uniform and symmetrical 
alignment, in addition to assessing the effects 
of practice. 


Methods 


Subjects. Eighteen male and 6 female students 
enrolled in an experimental psychology course at 
the University of Maryland served as Ss. All Ss had 
20/20 vision corrected or uncorrected. 

Apparatus. Sixteen 4 in. diameter dials were 
mounted in a 4 X 4 rectangular pattern on a 7 <7 
in. plywood block. Each dial contained a single 
fixed pointer 42 in. wide and %g in. long. Each of 
the 16 dials could be rotated 360°, so that any de- 
sired pointer configuration could be had with rela- 
tive ease. Three configurations of 16 pointers were 
used: (C1) all pointers uniformly aligned at 12 
o’clock, (C2) pointers opposing each other at 6 and 
12 o’clock, and (C3) all pointers uniformly aligned at 
6 o'clock. The configurations are shown in Fig. 1. 

The dial patterns were projected on a homogene- 
ous white screen by inserting the dial panel into an 
opaque projector which was fitted with an Alphax 
shutter set for 45 sec. exposure. The projected dials 
were 6 in. in diameter at a distance of approximately 
8 ft. from the Ss. The experimental room was dark- 
ened in order to maintain effective contrast. A suffi- 
cient amount of light was emitted by the projector 
to permit Ss to record responses. The time re- 
quired to complete the test arrangements permitted 
adequate adaptation to the level of illumination. 

Procedure. The 24 Ss were randomly divided into 
two groups, A and B, each containing 3 female Ss. 
Each group was tested separately. The Ss were 
seated in three rows of chairs in front of the pro- 
jector in such a manner that each S had a clear 


OOOO) |OOOO 
OOOO |OOOO 


OOOO |OODOO 
OOOQ |OOOO 


Fic. 1. The configurations tested in the experi- 
ment. The arrangement on the left is C1 (Uniform 
Alignment at the 12 o’clock position). The arrange- 
ment on the right is C2 (Pointer Symmetry), where 
Row 1 and Row 3 are set at 6 o’clock and Rows 2 
and 4 at 12 o’clock. C3 (not shown) is the same 
as Cl except that the pointers are set at 6 o’clock. 
The configurations do not show the single deviant 
pointer, which varied from trial to trial. 
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Table 1 


Means and Standard Deviations for Positional 
and Directional Errors * 








Positional 
Errors 
C1 


12.08 
3.54 


C2 


10.31 
3.61 


Directional 
Errors 


C2 Ci C2 


11.77 7.92 7.92 
3.97 2.75 3.93 


C1 c2 ctl 


7.85 6.83 4.67 
3.66 3.05 2.43 








GroupA Mean 
SD 








GroupB Mean 
SD 





* The mean configuration scores for each group are shown in 
the order in which the configurations were observed. 


view of the projected dials. Each S was provided 
with a test booklet which contained 69 16-dial 
panels for recording his responses. The Ss were in- 
structed in the following manner: “. . . each time a 
presentation is made you are to pick out which, if 
any, of the dials has a pointer pointing in a direc- 
tion different from all the other pointers. If you 
see such a dial place an arrow in the appropriate 
circle on your response sheet showing the location 
and direction of the deviating pointer.” 

On each presentation one of the dials had a pointer 
deviating 90° from the general reference direction. 

Two of the above-mentioned configurations were 
used in each testing session. The test session con- 
sisted of 32 presentations of one configuration, fol- 
lowed by 32 presentations of the other configura- 
tion. Thus each of the 16 dials in a particular 
configuration showed its pointer deviating twice, 
once to the right and once to the left. The location 
and direction of pointer deviation for each presenta- 
tion was randomized. 

This procedure provided two error scores. The 
first type of error was committed when S failed to 
identify the “different” dial. The second type of 
error was committed by S when he did not correctly 
identify the direction of the deviating pointer. The 
maximum error score for each type of error during 
a single session for a particular configuration was 32. 


Experiment I 


The purpose of the first experiment was to test 
the difference between configurations 1 and 2, and, 
in addition, to determine the transfer effects of one 
configuration on the other. 

This experiment consisted of a single session for 
each group. Group A observed configuration 1 first, 
followed by configuration 2, while Group B observed 
these configurations in reverse order. The means and 
standard deviations for each of the error scores are 
shown in Table 1. 


An over-all test of significance 1 for positional and 
directional errors, respectively, between the two con- 
figurations used did not prove to be significant (¢ = 
.70, .86), indicating that the two patterns were of 
equal difficulty. When the order of observing the 
configurations is taken into consideration, a different 
picture presents itself. With respect to positional 
errors, the difference between the two configurations 
for Group A is small and not significant. The dif- 
ference between the two configurations for Group B, 
however, is significant in favor of configuration 1, 
which was uniform alignment (t = 2.23). The mean 
difference between the two groups on configuration 
1 is also significant (t= 2.87). These results also 
hold true for directional errors. 

The results of the statistical analysis for direc- 
tional errors are similar to the ones presented above. 
There is no mean difference between the two con- 
figurations for Group A. The difference between 
the two configurations for Group B is significant 
(¢=2.32) in favor of configuration 1, uniform 
alignment. The mean difference between Groups A 
and B on configuration 1 is also significant (t = 3.06). 

These results seem to indicate that configuration 2 
(pointer symmetry) is essentially a more difficult 
task, as least initially. It will be noticed that the 
transfer effect from C 1 to C 2 for Group A is prac- 
tically nil, while the transfer effect from C 2 to C 1 
for Group B is substantially greater. It was this in- 
terpretation which led us to the follow-up experi- 
ment described below. 


Experiment II 


This experiment was carried out six weeks after 
the first experiment. The same groups of Ss were 
used. The purpose was to verify the results of the 
first experiment with uniform alignment at 6 o'clock, 
in addition to determining the effects of further 
practice. 

Each group was given two testing sessions one 
week apart. In the first session Group A observed 
configuration 2 first, followed by configuration 3. 
In the second session they observed configuration 3 
first, followed by configuration 2. Group B ob- 
served the configurations in the reverse order. The 
means and standard deviations for positional and 
directional errors for both sessions are shown in 
Table 2. 

An over-all test of significance for positional 
errors between the two configurations in Session 1 
proved to be significant in favor of C 3 (uniform 
alignment) (t= 2.29). This finding, however, does 
not hold true for directional errors (¢ = 1.21). 

With respect to transfer effects, the results of Ses- 
sion 1 substantiate the results found in the first ex- 
periment. The difference between C 2 and C 3 for 
Group A for both positional and directional errors, 
respectively, proved to be significant (¢ = 3.0, 3.75). 
A significant difference is also found for positional 


1 All mean differences were tested by ¢ ratios. The 
05 level of confidence was required for statistical 
significance. 
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Table 2 


Means and Standard Deviations for Positional and Directional Errors 








Positional Errors Directional Errors 








Session 2 
C3 
3.69 
2.89 
cs 
3.83 
3.13 


Session 1 
C2 C3 


5.08 2.33 
1.75 1.84 


C3 C2 
4.67 4.42 
2.43 3.52 


Session 2 


C3 C2 


6.50 6.50 
3.91 3.48 


C2 C3 


6.33 6.17 
3.75 4.69 


Session 1 


C2 C3 


8.25 4.50 
3.37 2.69 


C3 C2 


5.08 7.17 
2.66 5.89 


C2 
4.00 
1.35 
C3 


4.08 
2.78 





SD 








Mean 
SD 





errors between the first presentation of C 2 to Group 
A and the first presentation of C 3 to Group B (t= 
2.46). None of the other differences in Session 1 
were significant. 

None of the differences between groups or be- 
tween configurations in Session 2 were found to be 
significant. This is true for both positional and di- 
rectional errors. It will be noticed that the differ- 
ences between all possible combinations in Session 2 
are extremely small, indicating that both configura- 
tions are equally effective. While the difference is 
not significant, it is, however, difficult to interpret 
the increase in errors made to configuration 3 dur- 
ing Session 2. A decrease might be expected on the 
basis of the additional practice. 


Discussion 


The results of these two experiments verify 
the conclusion reached by Johnsgard (3), 
namely, that pointer symmetry is equally as 
effective as uniform alignment for purposes 
of check reading. This conclusion, however, 
is true only after an extended period of prac- 


tice on the two different configurations. Our 
results do not, however, support the evi- 
dence presented by Johnsgard indicating that 
pointer symmetry may actually be superior 
to uniform alignment. On the contrary, the 
differences found in Experiment I and Session 
1 of Experiment II favor uniform alignment. 
This lack of agreement may be due in part to 
the different exposure times used. Johnsgard 
used a 14-sec. exposure, while we used a ¥,- 
sec. exposure. Our notion is that any basic 
difference between the two configurations 
would be forced to show up under more de- 
manding conditions. The only differences 
which did exhibit themselves were a function 


of the order in which the configurations were 
observed. For example, in the first experi- 
ment it was shown that the transfer effects 
from pointer symmetry to uniform alignment 
were greater than for the reverse direction. 
The relationship was verified in Session 1 of 
Experiment II. On this basis it would ap- 
pear that at least initially the configuration 
showing pointer symmetry was a more diffi- 
cult task. The effects of practice between 
Experiment I and Session 1 of Experiment II 
are rather large, and the observed decrease in 
errors is significant for both positional and 
directional errors. 


Conclusions 


The following conclusions may be drawn 
from the results of the two experiments: 


1. Configurations employing uniform align- 
ment or symmetrical alignment of pointers are 
equally effective for check reading after an 
extended period of practice. 

2. During the early stages of practice con- 
figurations employing symmetrical alignment 
appear to be more difficult than configurations 
employing uniform alignment. 

3. Transfer effects from pointer symmetry 
to uniform alignment are greater than the 
transfer effects from uniform alignment to 
pointer symmetry. 

4. A convenient technique for group test- 
ing of check reading of projected dial faces, 
which are easily modified, has been described. 


-_= 
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Control panels for complex electronic equip- 
ment often contain many switches. In many 
applications the operator could work more 
efficiently if he could select the appropriate 
switch without looking at it. Coding the 
switches by spacing them in groups or by 
using switch handles of different shapes should 
help the operator to locate a particular switch 
without visual cues. Weitz (2) has shown 
that accuracy of performance is significantly 
affected by the shape of aircraft-type con- 
trol knobs when visual cues are restricted. 
We might expect similar effects in the case of 
electronic control panels. 

The major purpose of the experiments re- 
ported here was to select a set of differently 
shaped switch handles that could be identified 
easily. Jenkins (1) has reported a very simi- 
lar investigation of shapes for aircraft-type 
controls. He selected sets of 8 and 11 knobs 
from sets of 25 and 22 knobs. The selected 
knobs were almost never confused in his 
studies. We adapted some of his shapes for 
the lever-switch handles, although most of 
the handles we studied are our own designs. 
In addition to selecting a set of easily dis- 
criminable shapes, we investigated the rela- 
tion between two experimental procedures for 
measuring the confusability of the handles. 
We also compared the relative discriminability 
of two different handle sizes. 


Experiment I 


In the first experiment we measured the confusions 
among the 16 handles shown in Fig. 1. The handles 
were made of black Lucite and were % in. long, and 
either % in. square or 36 in. in diameter. Although 
most of the handles were designed to be maximally 
discriminable from the others, we deliberately in- 
cluded some similar pairs of handles in order to 
check our measurement techniques. Specifically, we 
expected square and diamond, half moon and groove, 
and cross and triple groove to be confused more 
often than the other handles. 

Two methods, which we have called the learn 


1 This research was supported jointly by the Army, 
Navy, and Air Force under contract with the Mas- 
sachusetts Institute of Technology. 


method and the find method, were used to measure 
confusability. In the learn method, S learned to as- 
sociate a number from 1 through 16 with each of 
the handles in a typical paired-associates procedure. 
The numbers were assigned to the handles in a dif- 
ferent random order for each S. 

The presentation device for the learn method con- 
sisted of a circular disk mounted so that it rotated 
in, the vertical plane. The handles were mounted, 
near the periphery, with their axes perpendicular to 
the face of the disk. By rotating the disk behind a 
shield, E could position any handle near an aperture 
so that S could reach through to feel it, but could 
not see it. No attempt was made to control the 
manner in which Ss felt the handles, although they 
typically used the thumb and the index and middle 
fingers of their favored hand. Trials were arranged 
in blocks of 16, each handle being presented once in 
random order in each block. On each trial S was 
allowed approximately two seconds to feel the handle 
and report its number. The S was required to re- 
spond with some number on each trial, after which 
E announced the correct number. The handles were 
said to be learned when S responded correctly on all 
trials in one block. 

For the find method, a complete set of handles 
was mounted in a linear array on each of the four 
long sides of a rectangular box 2 in. X 2 in. X 18 in. 
Successive handles were one inch apart. The handles 
were arranged in a balanced random order in each 
of the four arrays. For each trial, E presented a 
“target” handle on the rotary device used in the 
learn method. The blindfolded S felt the target, and 
then tried to locate the handle with the same shape 
in one of the four arrays, which was placed directly 
to the left of the rotary device. The S was in- 
structed to begin at the right of the array and to 
feel each handle in turn until he either found the 
target shape or reached the end of the array. He 
could then skip around, if necessary, but he was re- 
quired to select a handle on each trial. Four blocks 
of 16 trials were given; each handle was presented 
once in each block, and once in each array. 

The Ss were 20 U. S. Army enlisted men with 
AGCT scores in the range 80-100. All Ss were run 
in both the learn and the find methods. Ten Ss 
were tested with the learn method first; the others 
were tested with the find method first. 


Results 


We first examined the differences between 
the two groups of 10 Ss. The group which 
learned after they found required an average 
of 24.4 blocks to reach the criterion, while 
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TRIANGLE 


TRIPLE GROOVE 


SPHERE 


Fic. 1. 


the group who learned first required an aver- 
age of 18.1 blocks. This difference is not 
statistically significant. Likewise, the aver- 
age number of learning errors was not signifi- 
cantly different for the two groups. (These 
statistics adequately describe the learning 
data since all individual learning curves were 
virtually linear.) In the find method, the 
average number of errors was significantly 
smaller for the group which learned first, pre- 
sumably because they were familiar with the 
shapes. 

Finally, we wanted to compare the pattern 
of confusions among handles in the two 
groups. The learning data for each S were 
tallied in a confusion matrix that showed the 
number of times S made each possible re- 
sponse to each of the stimuli. A pooled 
matrix was obtained by summing the corre- 
sponding cells in the individual matrices and 
dividing by the total number of responses 
made to each stimulus. Each cell entry is 


GROOVE 


SINGLE KNOB OOUBLE KNOB 


Handle shapes used in these experiments. 


then an estimate of the conditional probability 
of a specific response, given a particular 
stimulus. 

Using the term “confusion” for such ma- 
trices in the learn method seems to imply that 
all errors made during the learning process 


are being called confusions. We would pre- 
fer to distinguish conceptually between two 
sources of error: random guessing and spe- 
cific perceptual confusions. We have no way 
to classify any particular error as a guess or 
a confusion, but we can find the specific con- 
fusions, on the average, from the relative 
sizes of entries in the confusion matrix. If 
all errors were random guesses, the off-di- 
agonal entries would all be equal, except for 
statistical fluctuations. Thus a matrix with 
uniform off-diagonal entries indicates a homo- 
geneous set of stimuli, while departures from 
uniformity indicate the presence of specific 
perceptual confusions. Since we had no pre- 
cise statistical test of deviations from uni- 
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formity, we arbitrarily defined the largest 5% 
of the entries to be “predominant confusions.” 

When the confusion matrices were tallied 
for the two groups in the learn method, six of 
the twelve predominant confusions in each 
matrix were the same, indicating a moderate 
degree of agreement. We also computed the 
correlations of the corresponding off-diagonal 
cells and of the diagonal cells in the two ma- 
trices. The correlations were .70 and .63, re- 
spectively. They indicate, in a general way, 
how well the matrices agree. In order to 
evaluate the extent of individual differences, 
we split each group into two subgroups and 
obtained correlations between the subgroups. 
These correlations are of about the same 
magnitude as the correlations between groups, 
which suggests that any differences between 
groups are due to differences between indi- 
viduals. 

The data from the find method were also 
tallied in pooled confusion matrices for the 
two groups. The five largest confusions were 


common; the other predominant confusions 
did not overlap. The correlations for the find 
confusion matrices were .92 for diagonals, .93 
for off-diagonals, indicating a high degree of 


agreement. Thus, although the group who 
learned first made fewer errors in the find 
method, their pattern of confusions was very 
similar to that of the other group. 

These results made it reasonable to pool all 
20 Ss. The pooled matrices for the two meth- 
ods are shown in Tables 1 and 2. The pre- 
dominant confusions are shown in boldface 
type. Eight of these are common to the two 
matrices, indicating moderately good agree- 
ment between the two methods. The correla- 
tions between these tables are .50 for diago- 
nals and .83 for off-diagonals. We can safely 
conclude that both methods will find the 
same important confusions, although the fine 
structure of the confusion matrices may differ 
somewhat. 

The square-diamond and cross-triple groove 
confusions predominate. These were two of 
the three pairs deliberately put in to show 
confusion. The third pair that we expected 
to be confused often was half-moon groove; 
this pair also ranks among the top 12. Tables 
1 and 2 show that sphere, single knob, and 
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double knob, all of which have round, spe- 
cially shaped heads, were frequently confused 
with one another. The similarities in the 
other shapes that were often confused are 
equally obvious. 

The confusion between square and diamond 
is especially interesting since the two handles 
actually have the same shape but differ in 
orientation. Our Ss were not warned to pay 
attention to the orientation, and apparently 
had difficulty in learning to make this distinc- 
tion. In fact, many Ss did not learn to dis- 
tinguish between these two handles, but ap- 
peared to reach the criterion of one perfect 
block by chance. This can be seen by com- 
paring the square-diamond confusion made 
by the two groups of Ss in the find method. 
If S was not noting the orientation, then he 
would probably pick the first of the two that 
he reached. The “find first” group picked 
the first of the two on 75% of the presenta- 
tions of square or diamond, of which 40% 
were correct. The “learn first” group picked 
the first of the two on 60% of the presenta- 
tions, of which 38% were correct. Clearly the 
former group was rarely responding to ori- 
entation, and in the latter group some of the 
Ss were still not aware of the orientation cue. 


Experiment II 


In the second experiment we compared the 16 
handles used in the first experiment with a set of 
16 handles having the same shapes, but one-third 
smaller (4% in. in diameter, rather than *% in.). 
The shapes were identical except for some slight 
alterations in the base to facilitate mounting the 
handles. Only the learning method was used in this 
experiment. One group of seven Ss learned the 
small handles, and was then tested on the large 
handles. Another group of seven Ss learned the 
large handles and then transferred to the small size. 


The results of this experiment can be sum- 
marized very simply. The groups did not 
differ significantly in the number of learning 
errors, the number of blocks to reach cri- 
terion, or the number of errors on the trans- 
fer trials. An average of 2.4 errors was made 
on the first block of transfer trials, indicating 
that Ss can transfer very easily. Six of the 
predominant confusions overlap. The corre- 
lation of the off-diagonal entries in the con- 
fusion matrices of the two groups was .66; 
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Table 3 


Predominant Confusions in Pooled Confusion Matrix 
(Based on 27 Ss) 








Stimulus Response 





3 diamond 

2 square 

9 triple groove 
11 cross 
14 sphere 

7 half moon 

7 half moon 

8 groove 

4 ramp 

15 single knob 
16 double knob 

9 triple groove 


2 square 
3 diamond 
11 cross 
9 triple groove 
15 single knob 
5 square tab 
8 groove 
7 half moon 
1 triangle 
14 sphere 
14 sphere 
10 dumbbell 





* Proportion of presentations of the stimulus that resulted in 
the specific confusion. 


the correlation of the diagonal entries was .63. 
These correlations are about as large as those 
obtained between the two groups on the first 
experiment, and seem to be about as large as 
individual differences permit. 


Experiment III 


In the third experiment we tested two subsets of 
ten handles that were chosen from the 16 original 
large-sized handles. In order to increase the stability 
of the confusion pattern, we pooled the seven Ss in 
Experiment II who learned the large handles first 
with the 20 Ss from the first experiment to obtain 
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matrix is very similar to Table 1 and will not be 
presented here. The 12 most prevalent confusions 
of the pooled matrix are listed in Table 3. We se- 
lected two subsets of ten handles in such a way 
that most of these confusions were avoided. The 
sets are shown in Tables 4 and 5. 

We tested each subset by the “learn” method. 
Thirty USAF airmen served as Ss, 15 learning each 
subset. In this experiment we also determined how 
accurately the Ss could identify the handles visually 
after they had learned them tactually. Thus, when 
S reached criterion (one perfect block) on the tac- 
tual trials, the apparatus was placed so he could see 
the handles but not feel them. Learning trials were 
continued with the visual identification until he 
reached criterion. 


We first inspected the learning data for the 
two subsets. Neither the difference between 
the average number of learning errors nor the 
difference in the average number of blocks re- 
quired to reach criterion was significant. 

The transfer data show that the visual im- 
agery of our Ss was good. Both the Ss who 
learned Subset 1 and the Ss who learned Sub- 
set 2 were able to identify an average of 8.7 
of the ten handles on the first block of trials 
after the tactual learning, although they had 
never seen the handles before. They reached 
criterion in an average of 2.2 blocks for Sub- 
set 1, 1.8 blocks for Subset 2, which suggests 
that the shapes can easily be discriminated 
visually as well as tactually. 

The confusion matrices for the two subsets 


a “learn” confusion matrix based on 27 Ss. This are shown in Tables 4 and +5, and are more 


Table 4 


Confusion Matrix for Subset 1 * 








Response 





Stimulus 2 6 7 9 


8.2 2.1 6.9 2.7 
61.6 3.4 5.5 5.5 
7.5 2.1 6.2 1.4 
7.5 3.4 13.0 2.7 
6.2 72.6 2.7 3.4 
4.8 ai S27 4.1 
2.7 2.7 6.2 
4.8 4.1 4.8 
2.7 2.7 3.4 
5.5 6.2 3.4 


13 15 


1.4 6.2 
2.7 4.1 
6.9 4.8 
9.6 4.8 
2.7 2.7 
4.1 6.2 
0.7 4.8 
8.2 3.4 
58.9 5.5 
41 69.9 





triangle 
square 

ramp 

square tab 
eye 

half moon 
triple groove 
bullet 
standard 
single knob 


5.5 
54.8 
6.9 
3.4 
8.9 
2.1 
3.4 
2.1 
3.4 


1.4 
0.0 
34 


Total 111.5 93.9 103.4 104.8 95.2 99.3 112.4 





_ * Fifteen Ss were given a total of 146 trials per handle. Entries in any row show the proportion of presentations of the given 
stimulus that resulted in each response. Predominant confusions are shown by boldface type. 
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Table 5 


Confusion Matrix for Subset 2 * 








Response 





Stimulus 3 5 6 8 12 13 


triangle 7.3 . 2.9 0.7 2.2 ' 2.9 2.2 
diamond 59.9 ‘ 44 2.9 1.5 . 0.7 8.8 
ramp . 95 13.1 5.8 7.3 : 0.7 3.6 
square tab , 8.0 . 58.4 4.4 22 Y 2.9 4.4 
eye : 5.8 2 713 28 0.7 66 
groove s 6.6 13.9 5.8 49.6 , 2.2 4.4 
cross ; 2.9 ; 6.6 2.2 5.8 : 0.7 4.4 
bullet : 1.5 . 2.9 0.0 3.5 2.9 79.6 3.6 
standard 5; 2.9 ; 7.3 4.4 2.2 0.7 5.1 59.9 
sphere b 2.9 t 4.4 1.5 1.5 4.4 2.9 4.4 


107.3 78.0 116.1 76.7 





Total 99.2 113.8 984 102.3 999.8 





* Fifteen Ss were given a total of 137 trials per handle. 


Entries in any row show the proportion of presentations of the given 


stimulus that resulted in each response. Predominant confusions are shown by boldface type. 


homogeneous than the matrix for the com- 
plete set in that the off-diagonal entries are 
more similar, and the largest entries are less 
extreme. The average size of the entries in 
the subset matrices is greater than in the com- 
plete matrix because of the reduced number of 
response alternatives. In both subsets it ap- 
pears that if a smaller subset were wanted, it 
would be better to omit square tab and stand- 
ard than any other two handles. However, 
we must remember that there is always a 
largest entry. If we were to omit the square 
tab and standard, other confusions would 
become largest. Actually, each subset was 
learned quickly: an average of 9.7 blocks 
was required to learn Subset 1, an average 
of 9.1 blocks for Subset 2. 


Discussion 


These experiments strongly suggest that 
tactual coding of switch handles can aid an 
operator materially in many applications. 
Since the Ss quickly learned to identify these 
shapes tactually, it is clear that differences in 
shape are useful cues for distinguishing 
switches. The utility of shape cues is also 
indicated by the ease with which subjects 
transferred to handles of different sizes and 
by the apparent ease of changing from tac- 
tual to the visual identification. 

Most of our handles are not radially sym- 
metric; that is to say, they can be mounted in 


different orientations. In practice it is often 
difficult to set the handle at the desired ori- 
entation. The prominent confusion of square 
and diamond suggests that Ss could easily 
learn to disregard orientation cue, so that dif- 
ferent orientations should not be troublesome. 
In fact, it suggests that orientation should not 
be used as part of a coding scheme. Two 
handles of the same shape but in different ori- 
entation should signify the same thing, not 
two different things. 

Experiment I indicates that the find and 
learn methods will discover the same impor- 
tant confusions, although the fine structure of 
the confusion matrices may differ somewhat. 
Neither method is completely satisfactory. 
The results of the find method, which has 
been used by other investigators, depend on 
S’s familiarity with the set of stimuli. The 
S makes very few errors when he knows the 
stimulus alternatives, so the find method does 
not clearly show differences in the degree of 
perceptual confusion. The learning method 
provides a distribution with a greater range, 
but random guesses cannot be distinguished 
from specific confusions. 


Summary 


Three experiments are reported concerning 
the tactual identification of 16 differently 


shaped lever-switch handles. In Experiment 
I, two methods for measuring confusability 
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were compared: the find method, in which S 
searched through a set of 16 handles to find 
a particular handle, and the learn method, in 
which S learned to associate a number with 
each of the 16 handles. The methods agreed 
moderately but not perfectly in specifying 
the predominant confusions and in measuring 
their extent. Significantly fewer errors were 
made in the find method by Ss who had first 
been tested on the learn method, indicating 
that familiarity with the stimuli is an impor- 
tant factor. 

Experiment II showed that a set of 16 small 
handles (14-in. diameter), was learned as 
quickly as a corresponding set of large han- 
dles (%<-in. diameter), and that Ss trans- 
ferred from one size to the other with few 


errors. 
Two subsets of ten handles were selected 
in such a way that most of the predominant 
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confusions were avoided. In Experiment III 
these subsets were tested by the learn method 
and found to be more homogeneous than the 
present set of 16 handles. Each subset was 
learned quickly. When Ss first saw the han- 
dles they had previously learned tactually, 
they could identify the handles with very 
few errors. 


Received August 23, 1954. 
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Speed and Accuracy of Reading Polar Coordinates on a 
Horizontal Plotting Table *’ 


Bert F. Green and Lois K. Anderson 


Lincoln Laboratory, Massachusetts Institute of Technology 


In some visual display systems, targets ap- 
pear on a circular horizontal display. A polar- 
coordinate grid, consisting of range rings and 
azimuth markers, must be provided in order 
that the ranges and azimuths can be read. 
The range of a target is its distance from the 
origin; the azimuth, or bearing, of a target is 
the angle from a fixed “due north” radius to 
the radius through the target. The angle is 
measured in a clockwise direction, and varies 
from 0° to 360°. To select a polar-coordi- 
nate grid for such a display, it is necessary 
to know the speed and accuracy with which 
target positions can be read from grids hav- 
ing different patterns of range and azimuth 
indications. It is particularly important to 
know how speed and accuracy are affected by 
the amount of detail on a grid. Although a 
bright, detailed grid should be helpful in read- 
ing coordinates, it may obscure or mask the 
targets. It is necessary to find a grid that is 
relatively effective for coordinate reading but 
that interferes as little as possible with the 
targets. 

The authors are not aware of any experi- 
mental literature that bears directly on the 
problem. Garner, Saltzman, and Saltzman 
(4) have studied the problem of identifying 
range rings on a polar-coordinate grid. They 
find that, when eight range rings are to be 
identified, the rings should be broken into two 
groups of four by some coding device such as 
solid and dashed lines. 

Several investigators (1, 8,9) have studied 
some of the properties of range and azimuth 
estimates, but they did not study the effect 
of grid design. The literature on dial read- 


1The research in this document was supported 
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contract with the Massachusetts Institute of Tech- 
nology. 

2Mr. E. T. Klemmer, of the Human Factors Op- 
erations Research Laboratories, Washington, D. C., 
assisted the authors materially in the early stages of 
this experiment. 


ing, summarized by Kappauf (6) and by 
Chapanis, Garner, and Morgan (3), contains 
some relevant data on methods of dividing a 
scale into intervals. Our use of 10-unit, 20- 
unit, and 50-unit intervals is based on re- 
sults of dial-reading studies. 

This study was undertaken to obtain data 
on five specific grid designs. Data were ob- 
taifled concerning the time to locate a target, 
the time to read its range and azimuth, and 
the accuracy of these readings in terms of 
range and azimuth errors. We were inter- 
ested also in differences due to different ori- 
entations of the grid with respect to the ob- 
server’s place at the table. All targets were 
well above threshold on the display. 


Experimental Procedures 


Six different polar-coordinate grids were studied. 
Five of these were included in the main part of the 
experiment; the sixth served as a control to check 
initial differences in observer proficiency and to 
measure the improvement in this proficiency during 
the course of the experiment. The six grids are 
shown in Fig. 1. The outermost circle on each grid 
represents a range of 150 miles from the center. 
Just beyond the 150-mile range ring of each grid, 
the azimuth is indicated numerically every 10°. 
Range rings are not labeled. 

The displays used in the experiment were prepared 
on 35-mm. film. Each frame on the film corre- 
sponded to one experimental trial. The projected 
image of each frame consisted of a small circular 
target superimposed on one of the six grids. In 
preparing the film, we determined the position of 
the target by a random selection from a uniform 
distribution over the area of the grid. Since we 
were working with polar coordinates, we used a se- 
lection method in which all azimuths were equally 
likely to be selected, but in which the probability of 
selecting a particular range was proportional to the 
range. This method was necessary in order to ob- 
tain a uniform distribution of targets over the area 
of the grid, since the element of area in polar co- 
ordinates is rdr dé. 

The displays were projected from underneath onto 
a horizontal translucent screen that formed the cen- 
tral part of a circular horizontal plotting table. The 
screen consisted of tracing paper placed between a 
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Grid VI is the control grid. 
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piece of plate glass, 4 in. thick, and a piece of trans- 
parent polystyrene, 349 in. thick. The table was 32 
in. high and 60 in. in diameter. The grid on the 
projected display was 40 in. in diameter. The target 
was approximately 0.13 in. in diameter, which cor- 
responds to about one mile on the grid scale. The 
target size was constant for all ranges. 

Each experimental trial consisted of presenting one 
frame of the film strip. To start the trial, an elec- 
tric impulse was used to advance the projector one 
frame and simultaneously to start a Standard Elec- 
tric Timer. In order to measure the time required 
to locate and point to a target, we provided the ob- 
server with a microswitch attached to the timer by 
a long cord. As soon as the target was located, the 
observer (O) placed the arm of the microswitch on 
the target and pressed down to close the switch. 
This action stopped the timer. The duration indi- 
cated on the timer is called the pointing time. It 
includes the time required for O to find the target— 
the search time—and the time required to make the 
motor response of placing the microswitch on the 
target—the motor-response time. When O had 
closed the switch, he then read the azimuth of the 
target to the nearest degree, followed by the range 
to the nearest mile. On some trials, the experimenter 
(E) measured the total time taken to make the com- 
plete report, including the pointing time and the 
time to read the range and azimuth. This is called 
the report time. Others have called such times dis- 
junctive reaction times or visual-discrimination re- 
action times. 

Five laboratory assistants with 20/20 visual acuity 
were used as Os in the experiment. They were in- 
structed to locate the target as rapidly as possible, 
and to read the coordinates of the target as accu- 
. rately as possible. Reading speed was not em- 
phasized. 

The design of the experiment is shown in Table 1. 
Each of the five Os was tested with each of the six 
grids. For each O the first and last sessions con- 
sisted of 50 trials with the control grid. Within the 
five intermediate or experimental sessions—the main 


Table 1 


Design of Grid Experiment * 








Observers 





Sessions A B Cc D E 


. Control VIs0 VIxs Vio VIw VIx0 
° Experimental IIo TV Tyo II Tx5 Vaio 
° Experimental IITy40 Ves ITs10 TV Too 

b Experimental Vo Ilo IV25 Isio II To 
e Experimental Tos IITs10 V0 Ilo TViso 
. Experimental IVsi0 Iso IIIs Visco Tos 
. Control Vivo View Vix Visio VIs5 








* Roman numerals indicate the particular grid and the sub- 
scripts indicate the position of the observer at the horizontal 
table. 
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body of the experiment—the order of presentation 
of the grids was balanced. We also controlled O’s 
position at the table, relative to the grid. During a 
session, O stood at a fixed position at the table. The 
five different viewing positions that were used are 
indicated by the azimuth of the point on the grid 
nearest O; 30°, 90°, 140°, 225°, or 310°. Thus in 
the experimental sessions there were four variables, 
each with five categories: five Os, five grids, five po- 
sitions, and five sessions. A graeco-latin square de- 
sign was used. In this design each category of a 
variable appears once and only once with each cate- 
gory of each other variable. Each experimental ses- 
sion consisted of 100 experimental trials preceded by 
10 practice trials to familiarize the O with the grid. 


Results 
Accuracy 


In target-coordinate reading there are two 
types of errors. In the first place, Os must 
interpolate between successive range or azi- 
muth lines in order to estimate the target po- 
sition to the nearest degree and mile. This 
process sometimes results in errors that are 
small relative to the spacing of azimuth and 
range rings—of the order of one or two miles 
or degrees; such errors will be called “inter- 
polation errors.” In addition, Os occasionally 
make large, or “gross,” errors. These errors 
are of the order of the spacing of successive 
grid lines. Errors of 50 or 100 miles in range 
are typical gross errors. The two types of 
errors cannot be distinguished except in terms 
of some arbitrary criterion. In the present 
analysis we defined gross errors as all errors 
of more than half the interval between suc- 
cessive grid lines. Thus the cutting points 
for Grid I were + 25 miles and + 15°; for 
Grids II, IV, V, and VI, + 5 miles and + 5°; 
and for Grid III, + 10 miles and + 5°. The 
incidence of gross errors is very small. In 
almost all cases the gross errors are beyond 
four standard deviations from the mean, and 
may thus be classed as extreme deviants. We 
shall present separate analyses of interpola- 
tion errors and of gross errors. 

Gross errors. The most important fact 
about gross errors is that they are infrequent: 
they amount to less than 2% of all readings. 
Another characteristic is that some Os make 
more gross errors than others. One of our 
five Os accounted for more than half the total 
number of gross errors recorded in the experi- 
ment, while another O made no gross errors 
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on the five experimental grids. There are 
no statistically significant differences among 
grids with respect to the number of gross 
errors. 

Gross range errors are usually either 10, 50, 
or 100 miles, with some gross errors of 20 
miles on Grid III. Gross azimuth errors are 
usually either 10°, 100°, or 200° on all grids. 
In some cases, these errors are equal to the 
distance between successive range or azimuth 
markers. It seems reasonable to assume that 
these errors resulted from a confusion among 
markers. In other cases, particularly in in- 
stances of azimuth errors of 100° and 200°, 
some other explanation seems to be required 
since the errors bear no relation to the quad- 
rant divisions of our grids. (Gross errors of 
90°, 180°, or 270° never occurred in this ex- 
periment.) Gross errors of the second type 
appear to be decimal-system errors and may 
be interpreted as errors related to verbal 
habits. About 37% of the gross errors can 
be classed as verbal-habit errors. 
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Interpolation errors. The standard devia- 
tion of the distribution of interpolation errors 
is plotted in Fig. 2 as a function of the ordinal 
number of the grid. We numbered the grids 
in what we considered to be the order of com- 
plexity of the grid designs. 

The major differences in range and azi- 
muth accuracies can be attributed to the 
separation between successive range and azi- 
muth indications on the grids. Azimuth in- 
terpolations on Grids II, III, IV, and V, 
which have some sort of azimuth indication 
every 10°, are uniformly better than those 
for Grid I, which has azimuth indications 
every 30°. Range interpolations are worst on 
Grid I, which has range indications every 50 
miles; and they are best for Grids II, IV, and 
V, which have range indications every 10 
miles. Range interpolation errors for Grid 
III, which has range indications every 20 
miles, fall between these extremes. It appears 
that the major differences in interpolation ac- 
curacy are due to the size of the interval be- 
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Standard deviations of interpolation errors for range and azimuth 


readings for each grid. 
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tween successive indications. The form of 
the range and azimuth indications appears to 
be of secondary importance. Small hatches, 
partial range rings and azimuth radii, and 
small crosses yield roughly comparable re- 
sults. 

We note that Grid V leads to the smallest 
range and azimuth errors. Although the re- 
sults for Grid VI are not strictly comparable 
with the results for the experimental grids be- 
cause of the experimental design, it appears 
that Grids V and VI yield readings of about 
the same accuracy. Thus the form of the 
range and azimuth indications is of some im- 
portance, but the effect is relatively small. 

For each grid the standard deviation of the 
range errors is larger than that of the azi- 
muth errors. This result may simply reflect 
the difference in the size of the units of the 
two scales. The distance in miles represented 
by a one-degree azimuth interval is propor- 
tional to the range. For example, the dis- 
tance covered by one degree in azimuth at a 
range of 10 miles is 0.17 mile, while at a 
range of 100 miles the distance is 1.7 miles. 
On the average, for the 150-mile range used 
in these displays, this distance (measured on 
the range arc) is 1.75 miles. (This average 
is obtained by using a weight for each range 
proportional to the probability of the occur- 
rence of a target at that range.) Thus an 
error of one degree in azimuth represents a 
bigger discrepancy on the average than an 
error of one mile in range. If we compare a 
10-mile interval with a 10-degree interval, we 
find that, on the average, the 10-degree in- 
terval spans a larger linear extent and that a 
circular target thus spreads over a smaller 
proportion of the 10-degree azimuth interval 
than it does in the 10-mile range interval. 
This difference in the size of units is probably 
an important factor in the superior azimuth 
accuracy. 

In order to see if the differences among 
grids were statistically significant, we made 
an analysis of variance of the standard de- 
viations of interpolation errors; range errors 
and azimuth errors were analyzed separately. 
Since the sampling variability among stand- 
ard deviations is proportional to the average 
standard deviation, a logarithmic transforma- 
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tion was used in order to render the sampling 
variability approximately constant, a condi- 
tion required by the analysis. The analyses 
of variance are presented in Table 2. With 
a graeco-latin square design it is possible to 
assess the main effects of each variable, but 
it is not possible to determine any of the in- 
teractions. The major factor in both analy- 
ses is the difference among grids; this factor 
is highly significant. It is interesting to note 
that the differences among observers were 
significant with respect to azimuth errors but 
not with respect to range errors. Thus some 
Os are consistently more accurate on azimuth 
estimation, but there are no comparable dif- 
ferences in accuracy of range readings. In 
any case, the differences among Os were small. 

We have noted that the linear extent of the 
interval between successive azimuth indica- 
tions is proportional to the range. It is rea- 
sonable to expect that the interpolation errors 
in azimuth readings will decrease as the range 
increases, since the proportion of the azimuth 
interval covered by the target dot decreases. 


Table 2 


Analysis of Variance of Interpolation Errors (Logio of 
Standard Deviations of Error Distribution) 








(a) Analysis of Azimuth Errors 


Relative 
Expected Variance 
Mean Com- 
Ratio Square ponent 
0.0434 16.07* 0.27 
0.1014 37.56* 0.64 
0.0005 — 
0.0081 3.00 ~ 
0.0027 oe 


Mean F 
Source Square 
Observers 
Grids 


4 
4 
Positions 4 
+ 
8 


oe + 5e,? 
o2+50,7 
Order 

Residual 


Total 


0.09 
1.00 





(0) Analysis of Range Errors 





Relative 

Expected Variance 
Mean Com- 
Square ponent 


Mean F 
df Square Ratio 
0.0130 2.00 — 
0.1700 26.15* o2+502 
0.0042 — — 
0.0060 — — 
0.0065 , 


Source 


Observers 4 
Grids 4 
Positions 4 
Order 4 
Residual 8 
Total 24 


0.83 





* Significant at 0.1% level. 
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Standard deviations of azimuth interpolation errors for each grid 


for targets in certain range intervals. 


To investigate this possibility we divided the 
range into six intervals of 25 miles each, and 
found the standard deviation of azimuth in- 
terpolation errors for each interval on each 
grid. The results of this analysis are shown 
in Fig. 3. Since the points at the first range 


interval (1 to 25 miles) are based on very 
few measurements—none for Grid V—this in- 
terval should be discounted in considering 
differences among grids. Clearly, azimuth 
estimation is better at long ranges for all 
grids. The improvement is more marked for 
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the less complex grids; accuracy improves 
very slightly for Grid VI as the range in- 
creases. 

Although azimuth errors are larger at short 
ranges, they do not represent less accurate 
locations of the targets. When we convert 
azimuth errors to corresponding linear-dis- 
tance errors, we find that the increase in ac- 
curacy at long ranges is not proportional to 
the increased distance covered by a one-de- 
gree interval at these ranges. 

To show this relationship, in Fig. 3 we 
have plotted the hypothetical results for a 
case in which the standard deviation of the 
linear-distance errors of azimuth readings is 
equal for all ranges. (We arbitrarily chose a 
standard deviation of one mile. To obtain 
the comparable curve for any other SD, it is 
necessary to multiply each ordinate by the 
ratio of the new SD to the old SD.) This 
hypothetical curve has a steeper decline than 
the experimental curves, if the first range in- 
terval is disregarded. Thus, in terms of dis- 
tance errors, the readings are more accurate 
at short ranges. In evaluating the impor- 
tance of azimuth errors, it is always neces- 
sary to specify whether the angular error or 
the corresponding linear-distance error is more 
important. 

It should be remembered that the size of 
the target in our study was constant—it did 
not change with the range. In some visual 
displays, including some radar displays, the 
linear size of the target is related to its 
range. Somewhat different results concern- 
ing the relation of azimuth errors to range 
might be obtained in those circumstances. 

Resolution. Another way of interpreting 
the standard deviations of the interpolation 
errors is in terms of the resolution of the grid 
readings. In an operational situation we 
would not know the actual position of the 
target, but only the O’s report of this po- 
sition. We therefore need to know what con- 
fidence to place in his report. Since we know 
that the reading errors are approximately nor- 
mally distributed, we can use the tables of 
the normal distribution to relate the standard 
deviations to the proportion of reports in a 
certain interval about the target. For ex- 
ample, if we had a criterion of 95% accuracy 
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and if the SD of the range reports is 1.0 mile, 
we would say that the resolution of the range 
reports was 3.92 miles, since the range error 
will be between + 1.96 miles and — 1.96 
miles in 95% of the reports—that is, if we 
have a single report, the odds are 19 to 1 
that the target range is somewhere in an in- 
terval of 3.92 miles centered on the reported 
range. Thus the resolution is obtained by 
multiplying the SD by a factor that depends 
on the criterion of accuracy. The factor is 
3.29 for 90% accuracy, 3.92 for 95% ac- 
curacy, and 5.15 for 99% accuracy. The 
graphs of resolution are identical with the 
graphs of SD’s with a different scale on the 
ordinate. 

Other variables related to interpolation 
errors. We have not examined the fine struc- 
ture of the interpolation errors. Bartlett, 
Reed, and Duvoisin (1) investigated the 
standard deviation of range errors as a func- 
tion of the target’s distance from a range 
marker. They find that targets very close to 
a marker or halfway between two markers 
are read most accurately, while targets about 
one-third of the distance from one marker to 
the next are read least accurately. Chapanis 
and Leyzorek (2) have studied range inter- 
polation as a function of the number assigned 
to the range interval, and found that inter- 
polation on a ten-point scale is better than 
intervals numbered in other ways. Leyzorek 
(7) has also studied range interpolation as a 
function of the size of the interval. Reese 
et al. (8) and Rogers et al. (9) have studied 
the fine structure of azimuth estimates. 

We investigated the possibility that some 
interpolation errors could be attributed to 
number preferences of the observers. We 
might expect, for example, that 2, 5, and 8 
were used too frequently as the last digit of 
a range or azimuth report, with a correspond- 
ing lack of 1, 3, 4,6, 7, and 9. For the read- 
ings of Grid II we tabulated the frequency 
with which each number was used as a last 
digit in the range report and in the azimuth 
report. These frequencies exhibit random 
variation, but we found no evidence for num- 
ber preference. This negative result is in ac- 
cord with the results of Bartlett, Reed, and 
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Duvoisin (1) who found no pronounced num- 
ber preferences for range readings. 

We also checked the data to determine 
whether the last digit of the range report 
might be the same as the last digit of the azi- 
muth report more often than one would ex- 
pect by chance. The results of this analysis 
were also completely negative. 

In the analysis of interpolation errors, we 
found that the O’s position at the horizontal 
table relative to the grid was not an impor- 
tant factor in determining accuracy. We felt 
that this result might be due to averaging 
over all the readings of the observer. How- 
ever, an analysis indicated no relation be- 
tween the interpolation error for a target and 
its distance from the observer. 

It is somewhat surprising that targets on 
the far side of a five-foot table from the O 
should be read as accurately as targets near 
him. This would seem to imply that an O 
can view all parts of a display of this size 
equally well. What actually happened in the 
experiment was that the Os leaned over the 
table in order to see the targets on the far 
side. With this procedure, the maximum dis- 
tance of a target from the O’s eyes was about 
48 in., while the minimum distance was about 
30 in. 
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Speed 


Pointing time. As we noted above, the 
pointing time included both the search time 
and the motor response time. It seems rea- 
sonable to assume that the motor response 
time will not be affected by changes in grid 
design. Then, if grid design is related to 
pointing time, it must be related to search 
time. For this reason, the pointing time is 
an index of the extent to which the detail of 
the grid obscures the target. The average 
pointing time, plotted as a function of the 
relative amount of detail on the grid, is shown 
in Fig. 4. Grid V is clearly inferior to the 
other grids in this respect. The little crosses 
on this grid seem to be very much like tar- 
gets and thus tend to obscure the target. It 
should be noted that the grid lines and the 
target were equally bright in our experiment. 
If the target were brighter than the grid, or 
had a different color, the masking effect might 
be reduced materially. On the other hand, if 
the target is difficult to discriminate, due to 
the presence of extraneous visual stimuli, or 
“noise,” the masking effect might be increased. 

Report time. In preliminary experiments 
we found that report time was not related to 
accuracy, and did not vary significantly from 
grid to grid. This result was duplicated in 
this study. Report times did not vary sig- 
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Speed and Accuracy of Reading Polar Coordinates 


nificantly from grid to grid or from O to O. 
The average report time in the experiment 
was 8.3 sec. (In some preliminary experi- 
ments using abbreviated reporting procedures 
and emphasizing speed, we obtained average 
report times of about 4 sec.) 

Report time is a composite of pointing 
time and the time required to read the co- 
ordinates. Since report times were about the 
same for all grids, and since pointing times 
were longest for Grid V, it follows that co- 
ordinate-reading times were shortest for this 
grid. Apparently our Os, who were striving 
for accuracy, read coordinates faster from the 
more detailed grid. 


Learning Effects 


In order to evaluate the extent to which the 
Os improved in speed and accuracy of co- 
ordinate reading during the course of the ex- 
periment, we have compared the results for 
the pretest and the posttest on Grid VI. 

We compared the variances of interpolation 
errors in pretest and posttest by F ratios. 
When the five Os were pooled, we found that 
both azimuth and range readings improved 
significantly between the two tests. Examin- 
ing each O separately, we found that two im- 
proved significantly in both range and azi- 
muth accuracy, and one became less accurate 
in range readings; all other differences were 
insignificant. 

For all Os, the average pointing times de- 
creased from pretest to posttest, but the de- 
crease was not statistically significant for two 
of the Os. Report times were significantly 
shorter for only one O. 

In general, the learning effects were not 
striking. Some Os improved, others did not. 
Since our experimental design cancelled the 
effect of learning by balancing the order of 
presentation of grids, and since the order of 
presentation was not a significant factor in 
the analysis of variance, we feel sure that 
learning did not affect the results obtained 
in the study. 

It should be noted that we tried to mini- 
mize learning in this experiment. It was not 
our purpose to improve speed or accuracy. 
A properly designed training program might 
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produce marked increases in speed and ac- 
curacy of polar-coordinate reading. 


Discussion 


The purpose of this study was to select the 
best of the five experimental grids. Grid I 
can be eliminated because the readings are 
too inaccurate. Grids II, III, and IV are 
quite similar in most respects, although Grid 
III gives poorer range accuracy. Since Grid 
II is not significantly worse than III and IV 
in the measurements made, and since it has 
less detail to clutter up the display, Grid II 
should be used in preference to Grids III or 
IV. Grid V yields the most accurate read- 
ings, and can be read most swiftly, but it has 
a predominant masking effect. We suspect 
that with proper training, the coordinate- 
reading times for Grid II could be improved 
so that it was about as good as Grid V, but 
this is a conjecture. We conclude that the 
choice of a grid depends on the requirements 
of the situation. If extreme accuracy is of 
overriding importance, Grid V_ should be 
used. However, if it is important to reduce 
the masking effect of a grid, and if a slight 
loss in accuracy, or resolution, can be toler- 
ated, Grid II should be used. 

There are, of course, a great many grid de- 
signs that we have not considered in this 
study. It would be interesting to compare 
a series of grids with major azimuth indica- 
tions at 60° intervals. Some evidence of 
Gebbard, Barber, and Halsey (5) suggests 
that 60° intervals might improve azimuth ac- 
curacy. Although we have studied a small 
class of grids, we feel that many of our re- 
sults are applicable in general and can be 
used as guides by those who wish to design 
new polar-coordinate grids. 

Coordinate reading is time-consuming. 
When speed is not emphasized, readings are 
made in about eight seconds. In earlier stud- 
ies where speed was emphasized and abbrevi- 
ated reporting procedures were used, the time 
required to make a report was about four sec- 
onds. If an observer has many other things 
to do or many coordinate readings to make, 
then an automatic data take-off device would 
be advantageous. 
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Summary 


This study investigated the speed and ac- 
curacy of reading the range and azimuth of 
targets presented on six different polar-co- 
ordinate grids. The display was presented 
on a horizontal plotting table. The major re- 
sults are: 

1. Grid V yields greatest accuracy, but has 
a large interference, or masking, effect. Grid 
II, shown in Fig. 1, offers the best com- 
promise between accuracy of coordinate read- 
ing and sparseness of detail in the grid pat- 
tern. This grid makes extensive use of short 
reference marks, spaced along the major 
range and azimuth indication lines. These 
small reference lines appear to be a fair sub- 
stitute for additional complete range and azi- 
muth lines. 


2. In general, the accuracy of coordinate 
reading appears to depend more on the size 
of the interval between successive range and 
azimuth indications than on the form of these 
indications. Small hatches were nearly as 
effective as complete lines extending across 
the grid. 


3. The accuracy of reading the azimuths 
of targets increased as the range of the target 
increased for a target of a fixed size on the 
display. However, if angular error is con- 
verted to linear-distance error, the azimuth 
readings are more accurate for shorter ranges. 


4. Gross errors were infrequent: they 
amounted to less than 2% of all readings. 
In most cases, gross errors result from con- 
fusions among grid markings. However, 
gross azimuth errors of 100° and 200° seem 
to be related to verbal habits. 

5. Some grid designs tend to mask targets 
and thus to increase the time required to 
locate the target. 

6. The time required to report the target 
is approximately the same for all grids. This 
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time is about eight seconds when speed is not 
emphasized and about four seconds when 
speed is required. 
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The Effect of Immediately Preceding Task Brightness on Visual 
Performance ’* 


S. D. S. Spragg and J. W. Wulfeck ” 


University of Rochester 


The experiments to be reported here are 
part of a research program on human visual 
performance at low photopic brightnesses 
which was undertaken to determine some of 
the human factors important in aircraft in- 
strument lighting. 

Our previous research in this program (8, 
9) was concerned with the speed and accuracy 
of dial-reading performance as a function of 
the quantity and the quality of dial illumi- 
nation. These experiments employed task 
brightness ranging from 0.005 foot-lambert 
(which is barely above cone threshold) to 6.0 
foot-lamberts. The results indicated the ex- 


istence of a critical brightness level at ap- 
proximately 0.02 foot-lambert, below which 
dial-reading performance (as measured by 
both time and accuracy scores) becomes sig- 
nificantly poorer as brightness is decreased, 


and above which performance does not im- 
prove significantly with increases in bright- 
ness. 

On the basis of these results it was recom- 
mended that if airplane-instrument dial illu- 
mination were adjusted so that dial brightness 
would be just greater than 0.02 foot-lambert, 
this would be sufficient for adequate dial- 
reading performance and would at the same 
time maintain dark adaptation (for extra- 
cockpit visual tasks) to a greater degree than 
if higher dial brightnesses were permitted. 

A later series of experiments from our labo- 
ratory by Rock (6) corroborated the exist- 
ence of a critical brightness level at this ap- 
proximate value (0.02 to 0.05 foot-lambert) 
for a wide range of visual perceptual tasks. 
Subsequent experiments by workers in other 


1 The experiments reported here were conducted as 
part of a program of research on human factors 
related to aircraft instrument lighting, carried out 
under a research contract between the University 
of Rochester and Wright Air Development Center, 
ARDC, USAF, to whom these data have been re- 
ported in WADC Tech. Rep. 52-285. 

2jJ. W. W. is now at Tufts College, Medford, Mass. 
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laboratories (2, 3) have corroborated these 
results. 

In night contact flying conflicting visual 
demands are made on the airplane pilot. He 
must, for example, have sufficient illumina- 
tion for adequate performance of detailed 
(cone-vision) tasks within the cockpit, espe- 
cially instrument dial reading. On the other 
hand he must be well enough dark adapted 
so that he can adequately detect and dis- 
criminate visual objects of low brightness and 
low contrast outside the cockpit. At times 
the two kinds of visual task may be de- 
manded in rapid succession. 

The present study is directed at the prob- 
lem raised by this dual nature of a pilot’s 
visual task in night operations. Specifically 
it is an attempt to determine how visual per- 
formance at low photopic brightness levels is 
related to the brightness of an immediately 
preceding visual task. 


Method 


Apparatus. The apparatus used in this study was 
a modification of that employed in our previous 
dial-reading experiments (8, 9). The essential parts 
of the modified apparatus are shown in Fig. 1. The 
S sits inside a darkened booth, with his forehead 
against the headrest, A. The 11 X 14-in. bank of 
dials, D, is 28 in. from S’s eyes and 15° below his 
horizontal line of regard. The dials are evenly illu- 
minated by two sources of diffused light mounted 
symmetrically at either side of S’s head. One source 
is shown in the figure, at C, with its light baffle, B, 
to protect S’s eyes from stray light. Centered above 
the dials on S’s horizontal line of regard is a glass- 
windowed aperture, E, through which S views the 
far task periscopically by means of two front sur- 
face mirrors, F; and F2, mounted appropriately in a 
light-tight housing lined with heavy-nap black wool 
cloth which serves effectively to reduce internal re- 
flections of grazing incidence along the optical path. 
Thus the bank of Landolt rings, G, placed just be- 
low the second window of the housing, is visible to 
S in his horizontal line of regard over an optical 
path of 18 ft., which can be regarded as practically 
equivalent to monocular infinity. In the lamp hous- 
ing, H, mounted below the second window, is a 
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Fic. 1. Cutaway drawing of apparatus. 


tion of labels is given in the text. 


Explana- 


25-w. Mazda bulb and holders for neutral density 
filters and metal diaphragms. The inside of the 
housing is painted flat white and so designed that 
it throws diffuse illumination upon a ground-glass 
plate just below the window of the mirror com- 
partment. A narrow slot between the window and 
the ground-glass plate permits the introduction of 
glass slides, G, bearing the stimulus materials which 
are thus viewed by transmitted light. 

Toggle switches, microswitches, and relays were 
so arranged that time to read the bank of dials was 
recorded on one clock; then as the last dial was 
read, E flipped a switch which extinguished the dial 
illumination, turned on the Landolt-ring illumina- 
tion, and started the time clock which recorded time 
to read the rings. Error and time scores were re- 
corded on previously prepared data sheets. 

Stimulus materials for the dial-reading task were 
banks of 12 high-contrast photographic reproduc- 
tions of instrument dials which had been used in our 
previous studies (see 8, p. 130). The dials were 2.8 
in. in diameter and the scale was 100 X 10. The far 
task consisted of banks of nine Landolt rings with 


the breaks in the rings placed randomly at the four 
cardinal compass points. The rings, which were 1.5 
in. in diameter, were precision cut from uniform 
density film and mounted on glass plates; the con- 
trast ratio between rings and field was 1 to 4. The 
break in each ring subtended 5 min. of visual angle 
at this viewing distance. 

Illumination. All light sources were maintained by 
monitoring procedures at a constant level of 100 v., 
yielding a color temperature of approximately 2400° 
K. Illumination intensity was controlled by the use 
of metal diaphragms and neutral density filters at 
the sources. Three brightness levels of the near (in- 
strument-dial) field were selected in this way and 
their values checked by four observers, using a 
Macbeth illuminometer. The brightness values, 2.9, 
0.083, and 0.005 foot-lamberts, were chosen so as to 
bracket the critical 0.02 foot-lambert level and also 
to provide one relatively high illumination level. 
Similarly, five brightness values were chosen for the 
far (Landolt ring) task: 6.0, 0.076, 0.01, 0.007, and 
0.0035 foot-lamberts. 

Subjects. In Experiment I, 15 male high school 
juniors and seniors served as subjects. All of them 
were screened on the Bausch and Lomb Orthorater 
and met visual requirements equivalent to those es- 
tablished for our previous dial-reading studies (see 
8, p. 130). Thus they constituted a group with 
excellent visual acuity at near and at far, and with 
no other significant visual defects as measured by 
the Orthorater. 

In Experiment II, 12 male college students were 
used, visually screened in the same manner as the 
subjects for Experiment I. 

In Experiment III, which was a check on the re- 
sults of Experiment II, 4 subjects were used, se- 
lected from the same population as were the sub- 
jects for Experiment II. 


Results 


Experiment I. In this experiment S’s task 
was to read a bank of 12 instrument dials at 
the near (28 in.) distance at one of the three 
brightness levels, then immediately “read” 
the bank of nine Landolt rings at the far 
(18 ft.) distance at one of five brightness 
levels. Dial illumination was extinguished 
and, simultaneously, the bank of rings was 
illuminated as soon as S read the 12th dial. 
A latin-square experimental design was used 
so that each subject performed the dial-to- 
ring sequence under every combination of 
brightness pairs, and the order of presenta- 
tions was balanced for the group. 

Prior to the formal. experimental sessions, 
the Ss were made thoroughly familiar with 
the experimental materials and procedures. 
At the beginning of each of his three experi- 





Task Brightness 


Table 1 


Experiment I. Mean Time to Read a Bank of Nine 
Landolt Rings at Each of Five Brightnesses When the 
Immediately Preceding Dial-Reading Task Was at One 
of Three Brightnesses * 








Brightness 
of Pre- 
ceding 

Dial Taskt 


0.005 


Brightness of Ringsf 





0.0035 
75.8 
(N=9) 


71.7 
(N=7) 


0.007 
23.0 


0.01 0.076 
148 7.2 





0.083 268 133 68 


2.9 81.6 
(N=7) 


36.1 179 6.9 





* Time in seconds. 


N = 15, except for column 2. 
t In foot-lamberts. 


mental sessions, § was given standard instruc- 
tions (stressing speed and accuracy), pre- 
liminary practice, and was cone dark adapted 
to the level of brightness being used for the 
near (dial) task. 

At each of his three experimental sessions 
each S performed 5 dial-to-ring sequences 
under 5 of the 15 possible combinations of 
dial and task brightness. As in our previous 
studies, time and error data were recorded 
for the middle 10 of each bank of 12 dials. 
Time and error scores were also recorded for 
the Landolt-ring task but, except at the low- 
est ring brightness level, errors were almost 
nonexistent on the ring task. For this reason 
and because of the fact that error results have 
been shown to agree closely with time score 
data (8), our analysis has been based upon 
the time scores. Table 1 summarizes the re- 
sults of this experiment in terms of mean time 
to perform the second task (Landolt rings at 
monocular infinity) as a function of the 
brightness of the immediately preceding task 
(dials at 28 in. distance). 

Inspection of Table 1 shows clearly that 
time to read rings is much more closely re- 
lated to the brightness of that task than to the 
brightness of the preceding dial task. Per- 
formance curves based on the three rows of 
time scores would be highly similar through- 
out. 

An analysis of variance was not carried out 
because of obvious inhomogeneity and also 
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because of the absence of scores from some 
Ss at the 0.0035 foot-lambert level (due to 
their inability to read the materials at such a 
low brightness). The use of ¢ tests between 
pairs of means within columns (Table 1) 
should be accompanied by certain reserva- 
tions, but may be suggestive. Significant ¢’s 
were found only for certain pairs of means in 
the 0.007 and the 0.01 foot-lambert columns. 
In the former case, when the brightness of 
the second task (rings) was 0.007 foot-lam- 
bert, performance was significantly poorer if 
the preceding task brightness was at 2.9 foot- 
lamberts rather than at 0.005 foot-lambert 
(36.1 vs. 23.0). However, performance was 
not significantly different whether preceding 
task brightness was at 2.9 or at 0.083 foot- 
lamberts (36.1 vs. 26.8). 

In addition, we have an inversion in the re- 
sults when the brightness of the second task 
was 0.01 foot-lambert. Here, although the 
performance difference between preceding task 
brightnesses of 2.9 and 0.083 (17.9 vs. 13.3) 
is significant, yet the performance dliffer- 
ence between the more widely separated pre- 
ceding brightnesses of 2.9 and 0.005 (17.9 
vs. 14.8) is not statistically significant. Fur- 
ther, the mean time to read rings following a 
brightness of 0.005 foot-lambert was slightly 
greater than that for 0.083 foot-lambert (14.8 
vs. 13.3). This difference, although not sta- 
tistically significant, is in the opposite direc- 
tion to what would be predicted if dark 
adaptation were playing an effective part in 
these results. 

It seems justified to conclude that in Ex- 
periment I performance on the second task 
was essentially a function of the brightness 
of that task and was not related in any gen- 
eral way to the brightness of the first task. 

Experiment II. In this experiment the 
order of the tasks was reversed. Ss were re- 
quired to read 9 Landolt rings at the far dis- 
tance, then immediately shift to the bank of 
12 instrument dials at 28 in. The same 5 
brightnesses for rings and 3 brightnesses for 
dials that were employed in Experiment I 
were also used here. The 12 Ss were used in 
a latin-square experimental design and each 
S served for five sessions. At each session S 
performed the far task at one of the five 
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Table 2 


Experiment II. Mean Time to Read a Bank of Dials 
at Each of Three Brightnesses when the Immediately 
Preceding Landolt-Ring Task Was at One of Five 
Brightnesses * 








Brightness 
of Precedin 
Ring Task 


0.0035 


Brightness of Dialsft 


0.083 2.9 


12.9 12.3 
(N=10) (N= 10) 
14.9 13.8 
14.2 13.5 
13.5 13.0 
13.7 12.9 





0.005 
19.0 
(N= 10) 
20.5 





0.007 
0.01 21.3 
0.076 20.3 
6.0 21.7 





* Time in seconds. 


N = 12 except for first row. 
t In foot-lamberts. 


brightnesses (held constant for that session) 
paired with all three brightness of the near 
task. Order of presentation of far task 
brightnesses was varied but not completely 
balanced; order of presentation of near task 
brightness was completely balanced. All 
other aspects of the procedure were the same 
as for Experiment I. 

Table 2 summarizes the results of Experi- 
ment II and shows the mean time to perform 
the second task (dial reading at 28 in.) as a 
function of the brightness of the immediately 
preceding task (rings at monocular infinity). 
Again, as in Experiment I, it will be seen 
that performance on the second task is clearly 
related to the brightness of that task but 
bears no general relation to the brightness of 
the preceding task. Nowhere in this experi- 
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Fic. 2. Pooled data from Experiments I, II, and 
III, showing visual performance as a function of 
task brightness. The arrow is placed at 0.05 foot- 
lambert. 


ment did there result a significant difference 
between means due to differing brightness of 
the preceding task. 

Experiment III, On completion of Experi- 
ment II it was realized that the negative re- 
sults could have been due to our established 
procedure of taking time and error data only 
on the middle 10 of each bank of 12 dials. 
That is, there might have been genuine time 
differences in dial performance due to bright- 
ness of the preceding ring task which were 
concentrated into the period between the 
presentation of the task and the reading of 
the first dial, and thus escaped notice in our 
scoring procedure. 

Accordingly, the present experiment was 
performed on 4 Ss selected in the same man- 
ner from the same population as those used 
in Experiment II. The experiment was a 
duplicate of Experiment II in all respects 
save that times were recorded to read all 12 
dials in each bank of dials. 

The results, summarized in Table 3, indi- 
cate a very close correspondence with those 
of Experiment II. Performance in reading 
the dials is related to the brightness of the 
dials but is unrelated to the brightness of the 
preceding ring task, even when time scores 
are based on all 12 dials. We can be con- 
fident, then, that the results of Experiment IT 
were not because of a loss of crucial data due 
to recording time for only the middle 10 of 
each bank of 12 dials. 

Combined results. A further analysis of 
the data of the above three experiments was 
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Table 3 


Experiment III. Mean Time to Read a Bank of 
Dials (Recording Time for all 12 Dials) at Each of 
Three Brightnesses when the Preceding Landolt Ring 
Task Was at One of Five Brightnesses * 








Brightness 
of Preceding 

Ring Taskt 
0.0035 
0.007 
0.01 14.9 
0.076 14.4 
6.0 : 14.8 


Brightness of Dialst 





0.083 


14.3 
15.9 








* Time in seconds. 
t In foot-lamberts. 
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undertaken in order to compare these results 
with previous studies of visual performance 
as a function of task brightness which have 
been carried out in this laboratory (7, 8). 
Since no significant trend of differences due to 
brightness of preceding task was found, the 
results in each experiment have been pooled 
to show performance as a function of task 
brightness, and are presented in Fig. 2. 

Our previous studies of dial reading (8) 
and Rock’s study employing four widely dif- 
fering visual tasks (6) agreed in indicating a 
brightness of 0.02 to 0.05 foot-lambert as a 
value below which performance becomes in- 
creasingly poor, but above which brightness 
ceases to be a significant variable in such 
visual perceptual tasks. 

The arrow near the base line in Fig. 2 has 
been placed at 0.05 foot-lambert and it serves 
to point out that in all three of the present 
experiments we have corroboration of our 
previous results. On each of the three curves 
all differences for pairs of values which (a) 
lie below 0.05 foot-lambert or (0) cross this 
value, are highly significant statistically. In 
no case, however, is the difference between 
the values for the two highest brightnesses on 
any curve statistically significant. 


Discussion 


In the above three experiments we have 
presented evidence that performance on a far 
visual task at low photopic brightnesses is un- 
affected (within the brightness range used) 
by the brightness of an immediately preced- 
ing visual task at near distance (28 in.), and 
similarly that performance on the near task 
is not related to the brightness of the im- 
mediately preceding far task. 

These findings should not be interpreted as 
casting doubt upon the reality of dark adapta- 
tion. They are, obviously, a function of the 
range of brightnesses which we employed, 
and are to be understood accordingly. 

Our experiments are similar to those stud- 
ies which have measured the course of dark 
adaptation as a function of the brightness 
level of the preadapting stimulus field. In 
this context we can regard our first task in 
each experiment as a preadapting brightness 
value. And then, instead of asking how long 
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it takes to perform a given visual task after 
varying levels of preadapting brightness (as 
we did in these experiments), we might have 
phrased the question to ask how bright the 
second visual task must be in order to be per- 
formed promptly, following various levels of 
preadapting brightness. This is essentially 
the concept of the “instantaneous” threshold. 

Our preadapting brightnesses ranged from 
6.0 to 0.0035 foot-lamberts. The lowest level 
at which the second task was presented was 
0.0035 foot-lambert. Most dark-adaptation 
studies have used preadaptation brightnesses 
very much higher than ours, commonly at 
several hundreds or even thousands of foot- 
lamberts. In only a few instances have rela- 
tively low preadapting brightnesses been em- 
ployed, and even here the data are not as 
clear for our present purpose as we could 
wish. If, however, we examine the relevant 
results reported by Haig (4), Peckham (5), 
and others, two facts stand out. First, the 
instantaneous brightness threshold decreases 
rapidly with a decrease in preadapting bright- 
ness, and second, the early course of dark 
adaptation is increasingly rapid with decreas- 
ing preadapting brightness. 

If we relate our present findings to the first 
of these two facts, data from the studies cited 
above indicate that for a preadapting bright- 
ness of the order of our highest values (3 and 
6 foot-lamberts) the instantaneous threshold 
lies over a log unit below the lowest value 
which we used on our second task (0.0035 
foot-lambert). At lower levels of preadapt- 
ing brightness the instantaneous threshold 
would, of course, be still less. 

With respect to the second fact mentioned 
above, the evidence cited indicates that the 
sensitivity of the eye increases so rapidly fol- 
lowing preadapting brightnesses as low as 
ours that the threshold, after about 10 sec., 
would be from 2 to 3 log units below the low- 
est value of our second task. 

We can thus understand why performance 
on our second task was in each case so little 
affected by the brightness of the first task. 
Even after exposure to the highest first-task 
brightness which we employed (6.0 foot- 
lamberts in the ring-to-dial task) the in- 
stantaneous threshold for detailed visual tasks 
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was sufficiently low to permit prompt begin- 
ning of performance at the lowest brightness 
used for the second task, which, it should be 
recalled, was close to but not below cone 
threshold. 

In connection with the present findings at- 
tention should be called to a recent study by 
Brown and Grether (1). They found that 
preadaptation to an instrument panel or a 
white panel, lighted by red light or low tem- 
perature white light at values of .003, .020, 
or .084 foot-lambert, produced a small but 
statistically significant increase in visual 
threshold. However, this was absolute 
threshold, measured extrafoveally by a 3° 
test patch which was viewed 7° to the left of 
the fixation point of the right eye. Thus 
their results are not directly comparable to 
those of the present study because here the 
effects of preceding task brightness were 
measured on tasks requiring detailed foveal 
vision, and at photopic brightness values. 

With respect to practical applications, our 
findings have relevance as part of the con- 
siderations which should determine recom- 
mendations for instrument dial brightness for 
night flying and other night operational tasks, 
keeping in mind that our results are relevant 
to those visual tasks requiring detail (foveal) 
vision. The brightness values which we used, 
although restricted to only a part of the total 
possible range, do cover a range of over four 
log units in the low photopic region, and at 
the upper limit they correspond to a high 
level of instrument dial brightness. 

The present results indicate that our previ- 
ous recommendations (namely, that instru- 
ment dial brightness be maintained just above 
the critical 0.02 foot-lambert level [8, 9]) 
were probably too conservative as far as de- 
tailed visual tasks are concerned. Our pres- 
ent results support the statement that a 
brightness level which is one, or possibly two, 
log units above this value will provide ade- 
quate brightness for dial reading (with a con- 
siderable factor of safety) and will probably 
produce no significant interference with those 
low brightness visual tasks outside the cock- 
pit which require detail (foveal) vision. This 
is not to assert that there would be no effect 
on absolute (rod) threshold. The results of 
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these experiments do not contribute to that 
problem, but rather relate to the seeing of de- 
tails at low photopic values. Which is more 
important for night operations (absolute 
threshold or detail vision) should be decided 
by a thorough job analysis of the pilot’s 
visual tasks. 


Summary 


This study was undertaken to determine 
how visual performance at low photopic 
brightness levels is affected by the brightness 
of an immediately preceding visual task. 
Two visual tasks were employed. In one 
(the near task) Ss were required to read 
banks of photographic reproductions of in- 
strument dials under instructions stressing 
speed and accuracy. The viewing distance 
was 28 in., and the three task. brightnesses 
were 2.9, 0.083, and 0.005 foot-lamberts. In 
the other (the far task) Ss “read” banks of 
Landolt rings under speed and accuracy in- 
structions. They were viewed periscopically 
at a visual distance of 18 ft., and the five task 
brightnesses were 6.0, 0.076, 0.01, 0.007, and 
0.0035 foot-lamberts. 

The Ss were high school and college stu- 
dents with excellent visual abilities. In 
Experiment I, Ss (N =15) were visually 
adapted to the brightness level of the near 
task, performed the near task, then immedi- 
ately were given the far task. Experiment II 
was similar except that Ss (N = 12) first 
performed the far task, then the near task. 
A third experiment was carried out as a cor- 
roborative check on Experiment II. 

Analysis of results was based primarily on 
time scores.. Both the near-to-far and the 
far-to-near experiments showed that, within 
the brightness ranges used, performance on a 
visual task was related to the brightness of 
that task but bore no general relation to 
the brightness of the immediately preceding 
visual task. 

Comparison of these results with our earlier 
studies suggests that the critical 0.02 foot- 
lambert level of dial brightness previously 
found for this task can safely be exceeded by 
one, and possibly two, log units of brightness 
without impairing performance of a second, 
low photopic brightness task. It was empha- 
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sized that these findings refer to detailed 
visual tasks at brightnesses at or above cone 
threshold. 


Received July 29, 1954. 
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Connell (1) has reviewed the work that 
has been accomplished concerning the effect 
of heat upon performance. A number of 
gaps in our knowledge remain. One of these 
is the effect of temperatures actually ex- 
perienced during the summer months upon 
achievement in trade or technical schools. 
The present study is concerned with this 
problem. 


Method 


Two groups of U. S. Navy trainees entering the 
aviation electronics technician course at Memphis, 
Tennessee, were matched on the basis of a variable 
that correlated .62 and .64 with the two measures 
which were used in evaluating the achievement of 
the groups. The matching process resulted in one 
of the groups having a mean of 82.59 and a stand- 
ard deviation of 5.08 on the matching variable, while 
the other group had a mean of 82.71 and a stand- 
ard deviation of 4.92. These figures yield a critical 
ratio of .34, which, of course, does not approach 
statistical significance. The group having the slightly 
higher mean was the one that worked under the 
lower temperature condition. The period of the 
study was from June 15 through August 28, 1953. 
Approximately 80 men per week entered the course, 
resulting in a total N of 808. One group received 
instruction in an air-conditioned building while the 
other group received instruction in a building which 
was equipped with exhaust fans as cooling equip- 
ment. 

The temperature of the spaces was recorded twice 
daily, once in the morning and once in the after- 
noon. The median afternoon effective temperatures 
in the air-conditioned building and in the other 
building were 71.3 and 82.0, respectively. The cor- 
responding quartile deviations were 2.0 and 1.9. 
Effective temperature is an empirically determined 
index of the degree of warmth perceived upon ex- 
posure to different combinations of temperature, 
humidity, and air movement. The median effective 
temperature in the morning was about two degrees 
lower than in the afternoon in the non-air-condi- 
tioned building. There was little difference between 
the morning and afternoon temperature in the air- 
conditioned building. The median dry bulb tem- 
perature in the non-air-conditioned spaces was 86.3 
in the morning and 92.5 in the afternoon. The cor- 


1The opinions expressed are those of the writer 
and are not necessarily shared by the Department 
of the Navy. 
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responding quartile deviations were 3.0 and 3.3, re- 
spectively. 

The instructors who participated in the study were 
first divided into two groups on the basis of teach- 
ing experience. Then within these two groups they 
were assigned to the student groups working under 
the two temperature conditions by means of a table 
of random numbers. This was done to avoid a pos- 
sible systematic bias in the quality of the instruc- 
tors assigned to the two groups. Course content and 
examinations were identical for the two groups. 
Knowledge that an experiment is being conducted 
sometimes appears to affect performance. In view 
of this fact only a minimum number of people were 
informed that a study was in progress. A question- 
naire, given after the completion of the study, indi- 
cated that less than 2% of the subjects knew that 
they were involved in an experiment. Responses to 
two additional questionnaire items are reported in 
the next section. 

The study included the first month of instruction 
of a 7-month course in electronics. One measure of 
achievement was taken following the first unit, con- 
sisting of 80 hours or two weeks of instruction. The 
achievement test covering this instruction was ad- 
ministered to the two groups, composed of 404 men 
each. A second measure was taken covering the 
second unit, also consisting of 80 hours of instruc- 
tion, or at the end of the fourth week. Three hun- 
dred twenty-two of the men in the two original 
groups took this test in addition to the test which 
covered the first unit of instruction. The reason 
that 82 men in each of the original groups of 404 
men were not included in the comparison based on 
the second unit of work was that these men were 
in the last two classes to enter the course, and the 
experimental conditions had been changed to stand- 
ard training conditions before they completed the 
second unit of work. The exclusion of these men 
from the second comparison could have no effect on 
the results, other than to reduce the WN, since they 
were members of complete classes, and since match- 
ing was accomplished separately within each class. 

The reliability of the two achievement tests, as 
estimated by the Kuder-Richardson formula, was 
found to be above .90. In brief, the material cov- 
ered in the two units of instruction included the fol- 
lowing: basic theory of electricity, characteristics of 
electronic currents, meters, slide rule, applied mathe- 
matics, and Morse code. 


Results 


The results of the experiment are shown in 
Table 1. 
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Table 1 


Grades Made by Matched Groups Receiving Instruction 
Under Two Temperature Conditions 
(N = 808 for unit 1, 644 for unit 2) 








Lower 


Higher 
Temperature 


Temperature 





Mean SD 


75.07 9.14 
72.91 9.25 


Mean SD CR 


74.76 8.47 A 38 
71.88 10.06 1.76 .08 








The questionnaire, already referred to, 
asked the subjects to indicate which of five 
descriptive terms best described the tempera- 
ture in the classrooms and laboratories in 
which they received training. Eighty-six per 
cent of the men trained in air-conditioned 
spaces described the temperature as “com- 
fortable” while 10% said that it was “a little 
too cool.”” The men who received training in 
non-air-conditioned spaces responded quite 
differently, 74% stating that the temperature 
was “uncomfortably hot” and 24% saying 
that it was “a little too warm.” The remain- 
ing descriptive term, which has not been men- 
tioned, was “uncomfortably cold.” Less than 
1% of either group used this category. 

The second item on the questionnaire asked 
whether the subjects thought their learning 
was affected adversely by the temperature. 
Ninety-three per cent of the men in the group 
trained in the air-conditioned spaces did not 
think so, while 79% of the men who worked 
under the higher temperature condition, ap- 
parently erroneously, thought that their learn- 
ing had been affected adversely. 


Discussion 


Connell (1) arrived at a cautious, qualified 
generalization concerning the maximum tem- 
perature at which simple sedentary tasks may 
be undertaken without serious impairment. 
The critical temperature suggested by Con- 
nell was 85° F. effective temperature. At 
temperatures two or three degrees higher than 
85° deterioration has been demonstrated on a 
variety of tasks. This generalization concern- 
ing critical temperature was based primarily 
upon the work of Machworth (3, 4), Viteles 
and Smith (5), Weiner and Hutchinson (6), 
and Forlano, Barmack, and Coakley (2). 
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In the present study, on three-fourths of the 
days the temperature was in the eighties in 
the building that did not have air condi- 
tioning, but on only three days did the tem- 
perature reach 85° F. effective temperature. 
Therefore the finding of no significant differ- 
ence between the two groups is considered to 
be in accord with the above generalization. 
However, the finding was at variance with 
the opinions expressed by a large majority of 
the men who attended classes under the higher 
temperature conditions, as well as with opin- 
ions expressed informally by other persons 
connected with the training program. 

The practical implication of the study ap- 
pears to be that if one is interested only in 
the effectiveness of training, as opposed to 
comfort, reduced labor turnover, and other 
advantages sometimes attributed to air con- 
ditioning, essentially the same performance 
can be expected of U. S. Navy trainees, en- 
gaged in sedentary activities, under effective 
temperatures up to at least the low eighties. 
Stated geographically instead of in terms of 
effective temperature, one might reasonably 
expect no appreciable decrement in training 
effectiveness of U. S. Navy trainees, engaged 
in sedentary tasks, due to the summer heat in 
classrooms and laboratories in most areas of 
the United States lying at latitudes as high as 
that of Memphis, Tennessee, provided there 
is good movement of air. An _ interesting 
question arises as to whether one would be 
justified in extrapolating these two statements 
to include industrial and other civilian tech- 
nical training. Perhaps the most helpful in- 
formation on this subject concerns the char- 
acteristics of the sample. There are, of 
course, many variables on which a sample 
may be described, and there is the danger 
that the ones singled out may not be the 
most appropriate. However, it appears espe- 
cially pertinent to mention the following 
facts. The sample did not consist of “hard 
core” Navy men, but instead, virtually all of 
the men had been in the Navy less than six 
months when they entered the aviation elec- 
tronics school. Moreover, only a small pro- 
portion of the men are expected to re-enlist 
at the end of their first enlistment. Approxi- 
mately 14% of the men had had some col- 
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lege work; an additional 65% were high 
school graduates, and 21% had not com- 
pleted high school. The mean age of the 
group was slightly over 20, with little vari- 
ability above and below the mean. 

In the absence of an adequate measure of 
motivation it is difficult to estimate the ex- 
tent to which the effect of the summer heat 
may have been overcome by determination 
and extra effort. We do not know whether 
the motivation of the men in our sample was 
essentially the same as that of civilian groups 
undergoing training in technical or trade 
schools, a point which would limit the gen- 
erality of the results, but no evidence is avail- 
able as to whether or not it does. In any 
event, no claim is made that the results of the 
present study apply to industrial and other 
civilian technical training, but the proposi- 
tion that the same is true in these situations 
as in the military would be an attractive hy- 
pothesis for experimental investigation. 


Summary 


This study investigated the effect upon 
technical training of the summer temperature 


occurring at a U. S. Navy training center 
in the mid-south. The study involved two 
equated groups of 404 men each which re- 
ceived virtually identical treatment except 
that one of the groups attended classes in air- 
conditioned spaces and the other group did 
not. The latter group utilized exhaust fans. 
Less than 2% of the subjects knew that they 
were involved in an experiment. 


George Douglas May 


The subjects working under the higher tem- 
perature conditions thought that their learn- 
ing was impaired, but no significant differ- 
ence was found between their achievement 
and that of the other group. It was con- 
cluded that temperatures at least as high as 
the low eighties Fahrenheit effective tempera- 
ture do not result in an appreciable decre- 
ment in achievement in U. S. Navy technical 
school courses of a sedentary nature. 


Received August 20, 1954. 
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Prior to their appointment, all production 
supervisors at a California aircraft corpora- 
tion are given the Kuder Preference Record 
and the California Test of Mental Maturity 
as part of a selection and advancement pro- 
gram.‘ The firm also maintains objective 
records on the quality and quantity of pro- 
duction in its various departments. This ar- 
ticle will be concerned with the relationships 
between certain Kuder and CTMM scales 
and industrial criteria for the supervisors in 
29 production departments. A description of 
the three criteria used in validating the test 
scores will be given first, followed by a brief 
discussion of the analysis and results. 


Criteria 


Work-rework. Each department is charged with 
a certain number of production hours in a given pe- 
riod. Work turned out by the shop occasionally 
must be returned for reworking because of defi- 
ciencies determined by inspection. The amount of 
time spent on this reworking is charged as rework 
time. For each shop, for a period of approximately 
six months, a ratio of production hours to the re- 
work hours was determined. Since these data were 
available by weeks, odd and even week totals of the 
ratios were obtained for each department to allow 
the computation of reliability estimates. Correcting 
for double length, the reliability estimate was .95. 
The work-rework ratios were ranked to provide the 
final criterion used. 

Acceptance rate. In fabrication departments, jobs 
which are found by inspection not to be completed 
are returned to the department for further work. 
These are called “incomplete” jobs. In assembly 
shops the term for a comparable error is a “squawk” 
job. Where a job cannot be easily corrected and 


1This research was carried out under Contract 
N6-onr-—23815 between the University of Southern 
California and the Office of Naval Research. The 
opinions expressed are our own and are not neces- 
sarily shared by the Office of Naval Research. The 
project was directed by J. M. Pfiffner, with J. P. 
Guilford and H. J. Locke as associate responsible 
investigators. A. L. Comrey served as psychological 
consultant and Wallace S. High acted as a research 
psychologist. Statistical aid was provided by Lis- 
beth L. Goldberg. 


must be handled at the next higher administrative 
level, it is called an “inspection” job. Data on the 
number of such defectives in relation to the num- 
ber of jobs accepted were available for 28 depart- 
ments over the same six-month period. For each 
department, a total of “incomplete” or “squawk”’ 
plus “inspection” jobs per 100 accepted jobs was de- 
termined. The difference between this figure and 
100 constituted the number accepted per 100, the 
measure ranked to provide the acceptance-rate cri- 
terion. Using the same procedure as before, the re- 
liability was estimated to be .99. 

Production. For most of the work accomplished 
by each department, time standards have been es- 
tablished. Thus, for a particular job, 100 hours 
may be allotted and 125 hours actually may have 
been used. Standard hours allowed do not take into 
account such matters as practice effect, shortages, 
unavoidable delays, and other variables which have 
a considerable effect. The company therefore com- 
putes a factor to be multiplied by the standard hours 
which is designed to allow for these extraneous in- 
fluences. When the standard hours are corrected by 
this factor, they are called “allowed” hours. Ratios 
of allowed to actual hours were obtainable for 27 
departments over a 13-week period. Since definite 
differences were believed to exist between groups of 
departments collected into “burden centers,” depart- 
ment production scores were taken as deviations 
from the means of their respective burden centers. 
These deviation scores were ranked to provide the 
production criterion. No reliability estimate was 
available for these adjusted production data, but 
ratios of standard to actual hours yielded an odd- 
even reliability estimate of .99. 

All the reliability coefficients reported for these 
criteria are undoubtedly spuriously high. This is 
due to the fact that each department has a rather 
characteristic level which reflects the nature of its 
work per se in addition to its competence in per- 
forming that work. The resulting split-half reli- 
ability coefficient can be likened to a correlation be- 
tween alternate forms for a given person where each 
pair of scores is for a different test with a different 
standard deviation and mean. 


Results and Discussion 
Certain Kuder and CTMM scores for 227 
supervisors were available in 29 departments. 


The number of supervisors per department 
varied from two to 19, with a median of 
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eight. For the production criterion correla- 
tions, only 27 departments and 214 cases 
were involved; 28 departments and 217 cases 
were used for the correlations with the ac- 
ceptance-rate criterion. The smaller num- 
ber of cases in these instances was necessary 
because the criterion data could not be ob- 
tained for the missing departments. 

Since the criterion data were available by 
department, all test scores were reduced to 
department averages so that correlations 
could be computed, using the department as 
a unit. Thus, the test variables involved in 
this study were validated with organizational 
units rather than with individuals. The pro- 
cedure, then, consisted of the computation of 
an average for each variable over the super- 
visors in each department. These averages 
in turn were ranked separately for each vari- 
able. The correlations themselves were rank- 
order correlations between ranked average 
test scores and ranked criterion scores, using 
the department as the unit rather than the 
individual. The correlations, ranging from 
— 42 to .30, are presented in Table 1. An 
inspection of the regressions shows no evi- 
dence of curvilinear trends between any test 
variable and criterion measure. 

Using a formula given by Guilford (1), the 
standard error of a zero rank-order correla- 
tion with 29 cases is .19. If the normal curve 
is used as the sampling distribution, a rank 
correlation significantly different from zero at 
the 5% level of confidence would be .39. 


Table 1 


Rank Correlations Between Test Scores and Criteria 








Work- 
Rework 


Acceptance Produc- 


Test Variable Rate tion 


1. Ability (CTMM) 
a. Language 
b. Nonlanguage 
c. Total score 
2. Interest (Kuder) 
. Mechanical 





—.17 
—.21 
—.16 


—.07 
— .06 
— .05 


—.42* — .09 
. Computational —.03 ; .27 
. Science —.10 P 08 
. Persuasive .02 j 05 
. Social service —.11 : 

. Clerical —.09 ; 30 





* Significant at the 5% level of confidence. 


Andrew L. Comrey and Wallace S. High 


Table 2 


Test Score Means and Standard Deviations 








Standard 


Means Deviations 





Adult 
Men 


100.0 
100.0 
100.0 


Super- Adult 

visors Men 
14.3 16.0 
17.8 16.0 
12.8 16.0 


Super- 
Test Variables visors 
1. Ability (CTMM) 119.1 
Language 114.2 
Nonlanguage 117.7 
Total score 
2. Interest (Kuder) 
Mechanical 
Computational 
Science 
Persuasive 
Social service 
Clerical 





93.8 
57.0 
79.8 
60.0 
54.5 
22.3 


78.6 
35.3 
64.0 
74.4 
73.7 
52.1 


11.3 
17.7 
19.3 
30.8 
27.2 
17.8 


22.8 
10.6 
15.5 
20.6 
17.5 
13.5 





This estimate is probably somewhat too high 
for this situation because the department 
scores were derived by averaging scores for 
several supervisors in each department. Only 
one corrleation out of 27 in Table 1 exceeds 
39, a result which might be expected by 
chance. Furthermore, contrary to expecta- 
tion, this particular correlation showed a 
negative relationship. The only conclusion 
which can be drawn here is that the table of 
correlations presented does not warrant rejec- 
tion of the null hypothesis. 

These results, perhaps, do not constitute a 
fair indication of the possible validity of such 
test scales. They merely show that under 
present conditions the test variables are not 
valid. An inspection of Table 2 shows that 
restriction in range is present for the CTMM 
score, but correction for this condition would 
not alter the over-all validity picture here. 
A rather serious restriction in range is evi- 
dent for the Kuder Mechanical score, but in 
this case the correlation was negative. It 
would seem, therefore, that restriction in 
range is not responsible for the lack of va- 
lidity for these test scores in the situation in- 
vestigated. 
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The Problem 


That indices of mental ability alone do not 
predict scholastic achievement adequately is 
a well-known fact. It would seem a reason- 
able hypothesis, then, that nonintellective in- 
fluences, notably motivation or interest, are 
operative. This being the case, consideration 
of a measure of interest along with the more 
customary employment of some measure of 
academic aptitude could be hoped to enhance 
prediction of scholastic success. Further, se- 
lection only on the basis of ability to com- 
plete the curriculum is a shortsighted goal for 
administrators, especially in professional cur- 
ricula such as veterinary medicine. They 
might well be equally concerned about choos- 
ing those students who will be compatible 
with the requirements of veterinary medicine 
upon graduation and subsequent entry into 
actual practice. A high degree of interest in 
veterinary medicine is probably necessary for 
satisfaction and stability in the vocation. 
Thus the utilization of a measure of interest 
would be an aid in vocational counseling even 
though it may prove to be statistically unre- 
lated to scholastic success. 

Investigation of the stability of interests is 
also of importance to the fields of student se- 
lection and counseling, for if interests tend to 
undergo major changes with training and/or 
the passage of time, selection or counseling 
based upon such measures can have little or 
no validity. The present study was under- 
taken to fulfill three objectives: first, to de- 
termine the effectiveness of the veterinary 
scale of the Strong Vocational Interest Blank 
in the prediction of academic success in a 
curriculum of veterinary medicine; second, to 
determine the effect of training upon the in- 
terest scores; third, to investigate other pos- 
sible correlates with academic success in vet- 
erinary medicine. 


Method 


Criterion group 


The Strong Vocational Interest Blank was ad- 
ministered to 61 men who were freshmen in the vet- 
erinary division of Iowa State College in the fall of 
1949, and the blank was then scored for the vet- 
erinary scale. Other data available on this group 
included centile score on the American Council on 
Education Psychological Examination and _ prevet- 
erinary grade-point averages. 


Criteria of Success 


The measure of success employed was cumulative 
grade-point average in the veterinary course over a 
four-year period, which was thought to be a more 
valid index than is usual in other curricula since the 
students in this course follow a rather rigidly pre- 
scribed course of study. As a result, most of the 
subjects were taking the same course from the same 
instructor at the same time which would tend to 
place all students under a uniform competitive 
burden. 

The indices of interest score, preveterinary grade- 
point average, and ACE score were correlated with 
the criteria by the Pearson product-moment method. 


Comparison Group 


It became apparent during the early phases of the 
investigation that the correlation between veterinary 
interest scores and scholarship in the veterinary 
group was quite low; this could be ascribed to the 
homogeneity of interest within this group. That 
there was homogeneity is illustrated by the fact that 
the lowest-letter veterinary-interest rating in the 
group was B-, and only two of the 61 subjects 
held ratings this low. Another plausible assumption 
would be that there is no relationship between aca- 
demic success and interest scores. 

In order to investigate the first assumption, it was 
decided to compare the interests of the veterinary 
group with those of a nonveterinary group. In 
order to do this, a random sample of 50 males from 
other divisions of the college was drawn from the 
files of the Iowa State College Testing Bureau. 
Grade-point average, veterinary score on the Strong 
Blank, and ACE were also available for the mem- 
bers comprising this sample. 

Many members of the criterion group were vet- 
erans, which tended to make them older than the 
average student group, and since age may be related 
to interest, an attempt was made to select appro- 
priately older men for the comparison group. 


249 





T. E. Hannum and John B. Thrall 


Table 1 


Comparative Data on Veterinary and Nonveterinary Group 








Group 


Mean Veterina 
Interest Score* 


1949 1953 


Mean 
Grade- 
Point 
Average 


Mean 
ACE 
Score* 








Veterinary 
Nonveterinary 


61.19 
60.49 


49.62 49.00 
27.72 — 


2.82t 
2.36 





* Centiles. 
** Standard scores. 
t Preveterinary training only. 


Members of the comparison group were distributed 
nearly evenly in the various divisions of the college 
with 17 in engineering, 16 in science, and 15 in ag- 
riculture. 

All members of the criterion group had had one or 
more years or preveterinary schooling. Accordingly, 
the vast majority of the comparison group was se- 
lected on the basis of having completed at least one 
or more years of college. 

Discriminant function analysis was employed in 
comparing the criterion, and the nonveterinary 
groups. In the initial analysis the three variables— 
interest score on the veterinary scale, ACE score, 
and preveterinary grade-point average—were in- 
cluded. In subsequent analyses the variables were 
paired in all possible combinations. 

In order to test the efficiency of prediction af- 
forded by the various combinations of variables, the 
weighting factors derived were applied to the three 
principal indices for each member of the criterion 
group. It was thus possible to determine the per- 


centage of correct classification which each combina- , 


tion of variables would yield when members of this 
group were treated as applicants. However, while 
an idea of the relative effectiveness of each com- 
bination may be gained through this technique, it 
must be recognized that its application to a new 
group would reduce the accuracy of prediction. 


Stability of Interest 


The original group graduated in 1953, and of those 
still in school, 39 were retested on the Strong Blank 
at that time. 

A comparison of the results of this test with 
scores obtained when the subjects were freshmen was 
made to determine the effects of the training pro- 
gram upon veterinary interest. 

To ascertain the effects of level of academic 
achievement upon interest score, the groups were 
divided by quartiles, and the differences in the 
means on each of the two administrations were 
compared by ¢ tests. In addition, a correlation be- 
tween four-year grade-point average and score on 
the 1953 testing was computed by the Pearson 
product-moment method. Finally, the direction of 
interest change was correlated with membership in 


the upper and lower halves of the grade distribution 
by the phi coefficient. 

A Pearson product-moment correlation was com- 
puted for the two Strong administrations to test for 
reliability of the instrument and change in the group 
as a whole. 


Results 


Relationship Between the Variable and Scho- 
lastic Achievement 


The correlation between veterinary achieve- 
ment as measured by cumulative four-year 
grade-point average and interest scores on the 
Strong veterinary scale was .06. This is 
much lower than the correlation of .30 re- 
ported by Layton (2) between veterinary in- 
terest scores and freshman grades in veteri- 
nary school at the University of Minnesota. 
However, Layton’s group was unusual in that 
they were enrolled in a newly inaugurated 
program. Since the School of Veterinary 
Medicine at Iowa State College is long estab- 
lished and considerable effort is spent in 
screening applicants, it is felt that a more 
select group was enrolled. 

To account for the extremely low relation- 
ship between interest and scholastic achieve- 
ment, we would suggest the hypothesis that 
those students without a minimal amount 
of veterinary interest do not enroll in veteri- 
nary school. Like other professional student 
groups, veterinary students are very homo- 
geneous in terms of interest. Owing to this 
great similarity of veterinary interest, it 
seems likely that differences in other mo- 
tives such as social approval and competition 
for good grades become more important in 
their effect on achievement. Further, it seems 
apparent that long-established habits of work 





Strong Vocational Interest Blank 


are more influential than any other single fac- 
tor. This is evidenced by the high correla- 
tions customarily found between academic 
achievement and previous academic achieve- 
ment. In this study the correlation between 
preveterinary grade-point average and four- 
year grade-point average in veterinary school 
was .43, statistically significant at the .01 
level of confidence. 

A correlation of .33 was found between 
centile score on the ACE and four-year grade- 
point average in veterinary school. This cor- 
relation is statistically significant at the .01 
level of confidence, and is higher than that 
reported by Owens (3) of .02 between raw 
ACE scores and freshman veterinary grades 
for 111 men. 


Prediction Based on Comparison with the 
Nonveterinary Group 


The results of the discriminant function 
analysis employing all three variables are in- 
dicated in Table 2. . 

The data included in Table 2 indicate that 
an equation employing all three variables 
proved most effective in this case, correctly 
accepting 94% of the veterinary student 
group included in this study. A test of sig- 
nificance of the contribution of the various 
factors indicated that the contribution of the 
Strong and the GPA was statistically signifi- 
cant beyond the 1% level. The ACE con- 
tribution was not statistically significant; 
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therefore it appears that the Strong and GPA 
would select as satisfactorily as all three fac- 
tors combined. As previously mentioned, the 
application of these factors to a new group 
of applicants to the veterinary division would 
probably reduce the accuracy of prediction. 
A further study on this subject with the next 
beginning class is anticipated. 


Stability of Interest 


A test-retest correlation of .68 was ob- 
tained by correlating the 1949 and 1953 in- 
terest scores of the criterion group. This cor- 
relation was statistically significant at the .01 
level of confidence and compares favorably 
with the odd-even reliability coefficient of .72 
reported by Hannum (1) at the time of 
standardization oi the veterinary scale on the 
Strong. 

A statistically nonsignificant correlation of 
.06 between four-year grade-point average 
and the interest score on the 1953 test con- 
firmed that there is no general relationship 
between scholastic success and veterinary in- 
terest score in an already selected group. 

A T-test comparison of the test-retest 
scores by scholarship groupings indicated that 
the level of academic achievement had no dif- 
ferential effect upon the amount of change in- 
terest score. 

A phi coefficient of .10 indicated that mem- 
bership in the upper or lower half of the 
graduating class was not related to the direc- 
tion of change in interest score. 


Table 2 


Combinations of Variables and Their Predictive Efficiency 











Strong, 
Weighting Factors ACE, GPA* 


Combinations of Variables 


Strong, 


Strong 
ACE 





02348585 
00013936 
00210579 


GPA 
ACE 
Strong 


Critical V score for acceptance 15083400 


Application to veterinary group 
% correctly accepted 94 


.00030739 -- 
00223358 


10785199 


ACE, 
GPA alone 
.02437129 
.00005200 —** 

-00199098 — 


— 02550906 


14491522 .0598 1385 





* Strong score in standard score form, ACE score in centiles. 


** Not derived for single factor. 
+ Standard score. 





T. E. Hannum and John B. Thrall 


Summary 


The following conclusions were drawn from 
the results of this study: 


1. Within a group of veterinary students 
there is no significant relationship between 
measured interest in veterinary medicine and 
academic achievement in a veterinary medi- 
cine curriculum. 

2. When veterinary students are compared 
with students in other curricula it is possible 
to derive indices which predict curricular 
membership in veterinary medicine with a 
high degree of accuracy. 

3. Prediction of curricular membership in 
veterinary medicine is best made by a com- 
bined consideration of veterinary interest 
score, preveterinary grade-point average, and 
ACE score. However, the contribution made 
by the addition of the ACE would not war- 
rant the administration of the test for pre- 
dictive purposes. 


4. Academic training in veterinary medi- 
cine does not significantly affect measured in- 
terest in veterinary medicine. Neither does 
level of achievement in the veterinary medi- 
cine curriculum affect degree or direction of 
change in measured interest. 

5. The test-retest correlation of the veteri- 
nary scale of the Strong Vocational Interest 
Blank over a four-year period is acceptably 
high, being .68. 
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High School Tests and Measurements as Predictors of 
Occupational Status 
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This study considers the relative validity 
of all available high school tests and meas- 
urements as predictors of later occupational 
status for 97 male subjects. 


Method 


Population. In 1943, a group of 111 male stu- 
dents graduated from a Flint, Michigan, public high 
school. These students, initially selected at random 
from the high school population, were chosen for 
the present study because they had taken a variety 
of tests in the course of a guidance experiment. In 
1953 a survey was completed which ascertained the 
occupational status of 97 members of the original 
group of 111 subjects. (At the time that the sur- 
vey was conducted there were 107 living male gradu- 
ates.) 

The high school drew students from neighborhoods 
of homes and cultural standards which ranged from 
upper lower to upper middle class. When the sub- 
jects were in high school, most of their parents were 
employed at various levels of responsibility in Flint 
automobile plants. 

Instruments. Tests administered in the 9th grade 
were the Kuhlmann-Anderson Intelligence Test, 
Stanford Achievement Test, Detroit Mechanical 
Aptitude Test, Minnesota Paper Form Board, and 


the student form of the Bell Adjustment Inventory. 
Three of these tests were given again in the 12th 
grade. All of the tests, as well as the high school 
grade-point average, were correlated with job status 
in an attempt to determine their predictive value. 

A 5-point scale, originally known as the Taussig- 
Terman classification of the father’s occupation, was 
used as the criterion measure. On this scale, the sub- 
jects gave their jobs one of the following numerical 
ratings: 


. Unskilled 
. Semiskilled 
. Skilled 
. Business and Managerial 
5. Professional and High Executive 


Rating procedure. In the follow-up survey, each 
subject was asked to rate his job according to the 
5-point system of classification. He was also asked 
to give the title of his job and a brief description of 
his duties. The subject’s own rating of his job was 
checked against the additional information given, 
and, where necessary, the rating was changed to 
conform to the descriptive data. Virtually all of 
the men rated their jobs appropriately. Personal 
interviews and telephone conversations with a ma- 
jority of the subjects further confirmed the accuracy 
of their ratings. 


Table 1 


Criterion Mean and Standard Deviation, and Predictor Means, Standard Deviations, Coefficients of 





Correlation, and Probability Levels (97 Male High-School Graduates) 








Mean SD 


r with 
criterion 





Criterion: 
Occupational Status 


Predictors: 


High-School Grade Point Av.* 
Kuhlmann-Anderson, 9th 
Kuhlmann-Anderson, 12th 
Stanford Achievement, 9th 
Detroit Mechanical, 9th 

Minn. Paper Form Board, 9th 
Minn. Paper Form Board, 12th 
Bell Adjustment Inv., 9thtf 
Bell Adjustment Inv., 12th 


3.35 


2.00 49 38 
101.59 
106.10 
99.78 
116.97 

35.28 
43.85 
34.00 
28.80 


1.16 


<.01 
<.01 
<.02 
<.05 
<.05 
<.05 
<.05 
<.02 
>.10 


12.88 .28 
16.92 25 
12.55 29 
20.49 .23 
10.03 an 
10.53 21 
15.35 —.24 
16.84 — .06 





* In the Flint Public Schools the following values are assigned letter grades: A = 4, B = 3,C =2,D =1,E =0., 
+ On the Bell Inventory high scores indicate poor adjustment, and low scores, good adjustment. 
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Results 


In Table 1 the mean and standard devia- 
tion of the criterion and the means, standard 
deviations, correlations with criterion, and 
levels of confidence of the predictors are 
shown. As can be observed, the predictor 
which correlated highest with occupational 
status was the high school grade-point aver- 
age (r = .38). Lower correlations were ob- 
tained for the 9th and 12th grade adminis- 
trations of the Kuhlmann-Anderson (r = .28, 
.25), 9th grade Stanford Achievement (7 = 
.23), 9th grade Detroit Mechanical (r = .23), 
9th and 12th grade Minnesota Paper Form 
Board (r = .21, .21), and 9th grade Bell In- 
ventory (r = — .24). The lowest correlation 
was obtained for the 12th grade Bell Inven- 
tory (r= — .06). Two predictors, the high 
school grade-point average and the 9th grade 
Kuhlmann-Anderson, had p’s at better than 
the 1% level. 

Table 2 presents an intercorrelation matrix 
for all combinations of the independent vari- 
ables. Test-retest correlations were high for 
the Kuhlmann-Anderson (r = .84) and the 
Minnesota Paper Form Board (r = .80), but 
this was not true for the Bell Inventory (r = 
41). Other high test intercorrelations were 
those for 9th grade Kuhlmann-Anderson and 
9th grade Stanford Achievement (r = .75), 
and for 12th grade Kuhlmann-Anderson and 
9th grade Stanford Achievement (r = .70). 
Intercorrelations between the Bell Inventory, 


Cantoni 


both 9th and 12th grade administrations, and 
the remaining variables in the matrix were 
not significantly greater than zero. 

An effort was made to establish two bat- 
teries predictive of occupational status, the 
first of which would be the most valid bat- 
tery selected from 9th grade tests, and the 
second the most valid selected from all the 
independent variables. These multiple cor- 
relations were calculated according to the 
Dwyer test selection procedure and square- 
root method (1). For the 9th grade, the bat- 
tery which best predicted the criterion con- 
sisted of the Kuhlmann-Anderson and the 
Bell Inventory. This battery had an R of 
355, with p between the 1% and 5% levels. 
In selecting a battery from all the independ- 
ent variables, it was again found that two 
measures most efficiently predicted occupa- 
tional status. These were the high school 
grade-point average and the 9th grade Bell 
Inventory; here the R was .457, a value sig- 
nificant at better than the 1% level. 


Summary 


1. Tests and measurements available on 97 
male high school graduates were correlated 
with later occupational status to determine 
their value as predictors. The implied as- 
sumption of this study was that, at the time 
of the follow-up survey, the subjects had had 
sufficient time and opportunity to establish 
themselves in representative jobs. 


Table 2 


Intercorrelations of Certain High School Tests and Measurements (97 Male Subjects) 








Stan. 
Ach. 
9th 


ar 
75% 
70** 


K-A 
9th 


(37** 


K-A 
12th 
42** 
84** 


HSGPA 








HSGPA 

K-A, 9th 

K-A, 12th 

Stan. Ach., 9th 
Det. Mech., 9th 
Minn. PFB, 9th 
Minn. PFB, 12th 
Bell. Inv., 9th 
Bell Inv., 12th 


Bell 
Inv. 
12th 


Bell 
Inv. 
9th 


Det. 
Mech. 
9th 


22* 
36** 
43** 
22* 


Minn. 
PFB 
9th 


—_ 
37** 
45** 
_ 
.44** 


Minn. 
PFB 
12th 


a 
.38** 
46** 
32°° 
44** 
.80** 





.03 
—.07 
—.13 
—.15 
— .03 
—.13 
—.14 





* Significant at the 5% level. 
** Significant at the 1% level. 





High School Tests 


2. The best single 9th grade predictor was 
the Kuhlmann-Anderson [Intelligence Test. 
This scale had an r of .28 with occupational 
status. 

3. Of tests administered in the 9th grade, 
the most efficient battery included the Kuhl- 
mann-Anderson Intelligence Test and the Bell 
Adjustment Inventory. For this battery, R 
with the criterion was .355. 

4. The high school grade-point average 
was the best single predictor among the vari- 
ous high school measures. Here r with the 
criterion was .38. 
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5. A battery selected from all available 
high school tests and measurements proved 
the most efficient predictor of occupational 
status. This battery, comprised of the high 
school grade-point average and the 9th grade 
administration of the Bell Adjustment Inven- 
tory, had an R of .457. 
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Measurement * 
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There are two types of items most widely 
used in interest measurement. With one type 
a subject is presented with a list of items 
(activities, occupations, articles, people, etc.) 
grouped in triads. He is instructed to select 
from each triad the item he “likes most” and 
the item he “likes least.” This type of item 
is called “forced-choice” or “preference;” 
and, although the triadic form is not an es- 
sential aspect of the forced-choice item, it is 
the form used in the Kuder Preference Record 
and in the Minnesota Vocational Interest In- 
ventory. Pairs (5, 6), tetrads (10), and 
pentads (6) have also been used. in various 
kinds of psychological measurement. 

With the second type of item a subject is 
presented with a list of items to each of which 
he responds “like,” “indifferent,” or “dislike.” 
Because of these three possible responses this 
type of item is known as the L-I-D form. 


Most of the items in the Strong Vocational 
Interest Blank are of this or of the similar 


“Yes-?-No” type. Many personality inven- 
tories also are made up of the latter kind of 
item. 

Clark, who developed the Minnesota Vo- 
cational Interest Inventory especially for use 
at the skilled trades level, chose the forced- 
choice triad form of item “on the basis of 
rather meager experimental evidence, and on 
the basis of subjective appraisal of the types 
of items which would work best in the situa- 
tions where it is expected the inventory will 
be used” (1, pp. 5-6). The items in the 
Minnesota Vocational Interest Inventory were 
not grouped according to any prearranged 
plan, although efforts were made to maintain 
a comparable level of complexity for all three 
of the items in any one triad. Lack of 


1 This study was a portion of a dissertation sub- 
mitted to the faculty of the University of Minnesota 
in partial fulfillment of requirements for the degree 
of Doctor of Philosophy. The author wishes to ex- 
press his appreciation to his advisors, Professors 
D. G. Paterson and K. E. Clark, for their valuable 
assistance in all phases of the research. 


planned grouping may produce triads which 
require difficult decisions from subjects. A 
subject may not like any of the items, or he 
may like all three equally well. Clark re- 
ports (1) that this problem produced the 
most comment from subjects who cooperated 
in the development of the Vocational Interest 
Inventory. Such a situation, however, pre- 
sents no problem as far as developing an 
interest inventory is concerned. Triads of 
items of seemingly equal attractiveness may 
actually be delicate and subtle discriminators 
for certain occupational groups, in which 
case they are very useful; or responses may 
be about equally divided between the three 
items for all occupational groups, in which 
case they are simply not scored. In the latter 
case they are useless but not troublesome. 
With regard to administering an interest in- 
ventory, however, it would appear wise to 
avoid the resistance and dissatisfaction which 
such difficult choices might produce in sub- 
jects taking the inventory, unless such choices 
are particularly useful in discriminating 
among the interests of defined vocational 
groups. 

Kuder (6, 7) reports empirical studies 
which indicate that preference items do satisfy 
certain objective criteria for interest items. 
Cronbach (3, 4) has summarized evidence 
indicating that response sets exist in conven- 
tional objective tests; and he recommends the 
use of forced-choice, multiple-choice, and 
paired-comparison items rather than true- 
false, L-I-D, Yes-?-Ne, or check-list forms. 

When response sets are present, it is pos- 
sible to capitalize on them by using empiri- 
cal weighting procedures such as Strong’s. 
Strong does not mention type of item in dis- 
cussing the development of the Vocational In- 
terest Blank. In Vocational Interests of Men 
and Women (11) the present writer could find 
no indication of the basis for selection of the 
L-I-D form of item, nor of whether any other 
type of item was considered for most sec- 
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tions of the Blank. About three-fourths of 
the Vocational Interest Blank is made up of 
L-I-D type items, and the remainder con- 
sists of three types. For one type the sub- 
ject chooses that one of three statements 
which most applies to him. The three state- 
ments are usually on more or less of a con- 
tinuum. For another type the subject chooses 
between two items, but a response of “indif- 
ferent” is permitted. Strong states (11, p. 
670) that he prepared this part of the blank 
to “restrict the area to which the interest re- 
sponse is to be made.” The paired items are 
always closely related in some way. Strong’s 
opinion was that the objective was not en- 
tirely accomplished because of the prevalence 
of the “indifferent” response. The third type 
of item asks the subject to divide ten items 
into three most liked (or most important), 
three least liked (or least important), and 
four in between. This is actually a forced- 


choice type of item, though considerably dif- 
ferent from that used by Clark and Kuder. 
It is interesting, however, that in a number of 
studies of interest-item stability which Strong 
(11) reports, this section of forced-choice 
items seemed to be the most consistently in- 


ferior to the other sections. 

Zuckerman (13) carried out a study in 
which forced-choice and L-I-D forms of the 
same interest items were compared. He 
found no appreciable difference between the 
two forms of items in odd-even reliability or 
in differentiation between groups. Since more 
weights were possible from the L-I-D form 
and since the L-I-D inventory was shorter 
and required about half as much time to 
complete as the forced-choice inventory, 
Zuckerman concludes that L-I-D test item 
arrangement is clearly superior to the forced- 
choice form. He points out the need for fur- 
ther investigation, however, in the light of 
certain limitations in his design. 

The first limitation was that the construc- 
tion of inventories and the samples of sub- 
jects used were such that the effects of re- 
sponse sets may have been less than usual. 
Another limitation was the arbitrary method 
of grouping items into clusters of four from 
which the forced-choice pairs were con- 
structed. 
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The third limitation was the use of pairs 
in the forced-choice form. This limitation is 
particularly significant in view of the em- 
phasis Zuckerman puts on the number of 
weights available from each type of item. 
A single L-I-D item may yield three scoring 
weights, whereas a forced-choice pair yields 
only one weight for two items. Thus the 
L-I-D form may yield six times as many 
scoring weights per item as the forced-choice 
pairs. If triads are used in the forced-choice 
form, however, as in the Minnesota Voca- 
tional Interest Inventory and in the Kuder 
Preference Record, six weights may be ob- 
tained from each triad compared with the nine 
which would be possible from the same items 
in L-I-D form. That the possible number of 
scoring weights, however, is a less important 
indication of an item’s value than Zuckerman 
seems to feel is shown by his own results. He 
obtained equivalent discrimination with both 
forms of the inventory he used, although the 
112-item L-I-D form had a maximum of 336 
possible weights while the 168-pair forced- 
choice form had a maximum of only 168 
possible weights. Apparently, therefore, the 
number of scoring weights available from an 
item is a poor predictor of its empirically de- 
termined usefulness. 

In a later article (14), stimulated by pri- 
vate correspondence with L. J. Cronbach, 
Zuckerman points out an aspect of his re- 
sults which is related to the above discussion 
of scoring weights. Since his L-I-D form 
of inventory provided many more possible 
weights than his forced-choice form, there 
were more possibilities for weights to be 
based on chance differences on the L-I-D 
form. On an a priori basis, therefore, it 
would be expected that the validity of the 
L-I-D keys would shrink more in cross-vali- 
dation than the validity of the forced-choice 
keys. 

A very real limitation in his design that 
Zuckerman does not point out was the use of 
every pair which could be formed from each 
group of four items. This procedure resulted 
in the use of each item three times in the 
forced-choice inventory. There is nothing 
particularly objectionable in this procedure, 
but Zuckerman puts so much weight on the 
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greater length and longer time required by 
the forced-choice form that it should be 
pointed out that this greater length was a 
feature of the particular design of his forced- 
choice inventory and not of forced-choice 
inventories in general. The Minnesota Voca- 
tional Interest Inventory and the Kuder Pref- 
erence Record do not generally repeat the 
same item in different triads, so Zuckerman’s 
procedure is not similar in this respect to 
that which is in actual use in interest meas- 
urement at present. In this connection it is 
important to note that the L-I-D form was 
about half as long and took about half as 
much time as the forced-choice form, but it 
had only one-third as many items. From 
these results it would be predicted that a 
forced-choice form without repeated items 
would require less time than an L-I-D form 
using the same items. 

The present study was designed to investi- 
gate some of the questions left unanswered 
as a result of the limitations of Zuckerman’s 
study. 


Procedure 


Interest inventories. The Minnesota Vocational 
Interest Inventory (MVII) was used as the basis 
for a forced-choice inventory and an L-I-D inven- 
tory containing the same items. Of the 190 forced- 
choice triads in the MVII, 138 were selected on the 
basis of their contribution to existing Yeoman and 
Shipping-Stock Clerk interest keys. The number of 
triads was reduced from 190 to 138 in order to re- 
duce the task for subjects, who were required to 
complete three interest inventories. The particular 
basis used for selection was chosen so that a study 
of the aforementioned keys could be made (9). 

The 138 selected triads were arranged in an inven- 
tory called the Vocational Interest Blank, Form F 
(forced-choice). The inventory was mimeographed 
and contained seven pages. An L-I-D inventory 
was constructed from the same 414 items making 
up the 138 triads in Form F. This second inventory 
was called the Vocational Interest Blank, Form L 
(L-I-D). It was mimeographed and contained ten 
pages. 

The Strong Vocational Interest Blank also was 
administered to all subjects, but responses to it were 
not used for the present study. 

Subjects and administration. The two experimen- 
tal inventories and the SVIB were administered to 
178 students in introductory psychology classes at 
the University of Minnesota. Some inventories were 
given to students to take home and return. Others 
were group administered. Complete sets of usable 
inventories were obtained from 167 students. The 
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students were asked to indicate the time they began 
and the time they finished each inventory. The re- 
sponses for the group-administered inventories were 
used to provide an indication of the length of time 
required to complete each inventory. 

Form L, the SVIB, and the MVII were group ad- 
ministered to 167 Navy yeomen,? who are clerical 
workers in the Navy. There were 135 sets of usable 
returns. The yeomen were asked to indicate whether 
or not they would choose the same career if starting 
over again at age 18. Of the 135, 85, or 63%, 
marked “yes” while 50, or 37%, marked “no.” 
These responses were used as a crude indication of 
occupational satisfaction and dissatisfaction, respec- 
tively. 

Treatment of data. The students were randomly 
assigned to two groups with N’s of 100 and 67. The 
larger was called the criterion group and the smaller 
the validation group. Experimental interest keys 
were then constructed to differentiate yeomen from 
students on the basis of differences in percentages of 
response to items by the yeoman and criterion stu- 
dent groups. Two keys, a multiple-weight key and 
a unit-weight key, were constructed for both the 
forced-choice and the L-I-D inventories. Weights 
of from minus four to plus four were assigned to 
responses for the multiple-weight keys according to 
Strong’s weighting table. For the unit-weight keys 
weights of minus one or plus one were assigned to 
responses with percentage differences of 20 or greater, 
on the basis of results of Clark and Gee’s (2) studies 
of methods of selecting items for interest inventories. 
Positive weights were assigned when the yeomen’s 
percentage of response was greater than the students’ 
percentage. High scores on a key, therefore, indi- 
cate “yeoman interests,’ and low scores indicate 
“student interests.” 

All subjects were scored on each of the four keys, 
and means and standard deviations on each key 
were calculated for the following groups: yeomen, 
satisfied yeomen, dissatisfied yeomen, criterion stu- 
dents, and validation students. Mean differences and 
percentages of overlapping were then obtained for 
the following pairs of groups on each key: yeomen 
vs. criterion students (Y-CS), yeomen vs. validation 
students (Y-VS), criterion students vs. validation 
students (CS-VS), satisfied yeomen vs. criterion stu- 
dents (SY-CS), satisfied yeomen vs. validation stu- 
dents (SY-VS), satisfied yeomen vs. dissatisfied yeo- 
men (SY-DY). The Y-CS and Y-VS pairs provided 
measures of validities and cross-validities of the 
keys in differentiating between the groups for which 
the keys were built. An indication of validity 
shrinkage was obtained from the CS-VS pair of 
groups. It had been found that satisfied yeomen 
scored higher than both yeomen in general and dis- 


2The samples of yeomen were obtained under the 
auspices of the Chief of Naval Personnel and through 
the cooperation of the Commanding Officers, U. S. 
Naval Training Centers, Norfolk, Virginia, and San 
Diego, California. The cooperation of both these 
officers and the yeomen is sincerely appreciated. 
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satisfied yeomen on existing clerical keys (9), so the 
SY-CS, SY-VS, and SY-DY pairs were included as 
measures of additional validities of the experimen- 
tal keys. 

Percentages of overlapping were calculated accord- 
ing to the method given by Tilton (12) and also by 
actually tabulating the frequencies of overlapping 
scores. Overlapping scores are those which are above 
the point of intersection of the two distributions for 
the group with the lower mean and below the point 
of intersection for the group with the higher mean. 
The frequency method was used because it was 
necessary to know just which scores were overlap- 
ping and which were not overlapping on the differ- 
ent keys in order to calculate the standard error for 
the differences in percentages. The standard error 
formula used is given by McNemar (8) as formula 
28a. 

The forced-choice unit-weight key (FU) was com- 
pared with the L-I-D unit-weight key (LU), and 
the forced-choice multiple-weight key (FM) was 
compared with the L-I-D multiple-weight key (LM). 
Comparisons were made in terms of differences in 
amount of frequency overlapping provided by each 
key for all six pairs of groups. 


Results 


The mean time required for 38 students to 
complete Form L was 19.4 minutes. The 
standard deviation was 4.6. The mean time 


required by the same students to complete 


Form F was 20.6 mniutes. The standard 
deviation was 4.5. The difference in means 
is not statistically significant (.4 > p > .3) 
and is certainly of no practical significance. 

The Form L unit-weight key consisted of 
101 weights on 75 items, while the Form F 
unit-weight key consisted of 149 weights on 
106 items. There were 720 weights on 339 
items in the Form L multiple-weight key and 
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559 weights on 347 items in the Form F 
multiple-weight key. 

The means and standard deviations of the 
yeoman and student groups on each key are 
given in Table 1. Table 2 shows that each 
key differentiates well between the yeoman 
and student groups, with a rather small 
amount of overlapping. Little validity shrink- 
age of any key occurred, as shown by the lack 
of increase in amount of overlapping from the 
Y-CS differentiation to the Y-VS differentia- 
tion. Satisfactory cross-validity of each key 
is also shown by high overlapping and lack of 
differentiation of the two student groups. 
Each key provided considerably more differ- 
entiation between satisfied yeomen and stu- 
dents than between yeomen in general and 
students. Each key also adequately differ- 
entiated between satisfied yeomen and dis- 
satisfied yeomen, with about 60% overlap- 
ping of the two groups. 

Forced-choice and L-I-D keys are com- 
pared in terms of amount of overlapping of 
groups in Table 3. Ten of the 12 compari- 
sons are in terms of differentiation of groups 
which should be different. These differentia- 
tions represent various kinds of validity of 
the keys. In the ten comparisons, one differ- 
ence, statistically significant, favors an L-I-D 
key. There are no differences in two cases, 
and the other seven differences favor forced- 
choice keys. All seven are statistically sig- 
nificant. The average difference in frequency 
overlapping in favor of forced-choice keys is 
5.9%. 

The differences in Tilton overlapping paral- 
lel those in frequency overlapping. The pairs 


Table 1 


Means and Standard Deviations of Yeoman and Student Groups on Four Experimental 
Vocational Interest Keys 











FU 


LU LM 





Group Mean SD 


Mean SD Mean SD 





18.3 
15.0 
18.5 
17.6 
17.3 


Yeomen 18.8 
Satisfied yeomen 25.4 
Dissatisfied yeomen 7.7 
Criterion students — 20.6 
Validation students —22.2 


54.1 80.4 
80.7 70.6 
116 78.9 
—98.1 69.4 
—85.4 70.2 


114 15.1 
16.8 12.4 
1.6 14.0 
—16.1 12.8 
—15.4 12.7 
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Table 2 


Mean Differences, Critical Ratios, and Percentages of 
Overlapping of Pairs of Yeoman and Student 
Groups on Four Experimental Voca- 
tional Interest Keys 











Jo %Fre- 
Tilton quency 
Over- Over- 


Mean 
Differ- 
ence 


Pair of 
Groups 


Y-CS 

Y-VS 
CS-VS 
SY-CS 
SY-VS 
SY-DY 


CR 
39.5 16.67* 
41.0 15.56* 

—1.5** 0.56 
46.0 19.20* 
‘47.5 17.86* 

“17.6 YY 


lapping lapping 








27.3 
25.0 
96.5 
15.8 
14.0 
59.9 


26.4 
25.8 
97.0 
18.4 
13.2 
66.7 


Y-CS 

Y-VS 
CS-VS 
SY-CS 
SY-VS 
SY-DY 


166.6 
170.5 
—3.9** 
195.9 
199.8 
73.6 


15.79* 
14.74* 
0.31 
18.01* 
16.84* 

5.65* 


29.6 
27.3 
98.1 
18.2 
16.5 
60.7 


30.6 
24.8 
93.4 
19.5 
21.1 
63.7 


39.2 
35.6 
95.8 
24.9 
30.3 
57.8 


Y¥-CS 27.4 
Y-VS 26.8 
CS-VS 0.6 
SY-CS 32.8 
SY-VS 32.2 
SY-DY 15.2 


Y-CS 
Y-VS 
CS-VS 
SY-CS 
SY-VS 166.1 14.44* 
SY-DY 69.1 5.14* 


15.07* 
13.29" 
0.03 
17.68* 
15.69* 
6.33* 


32.5 
33.4 
98.2 
19.3 
19.9 
56.6 


152.2 
139.5 

12.7 
178.8 


15.52* 
12.66* 
1.15 
17.29* 


31.0 
35.4 
92.7 
20.2 
23.8 
64.5 


30.6 
35.6 
98.2 
24.9 
26.3 
63.7 





* Significant at the .01 level. 
* The minus sign indicates that validation students average 
lower than criterion students. 


which show no differences in the latter, how- 
ever, show slight differences in the former in 
favor of forced-choice keys. The average dif- 
ference in Tilton overlapping in favor of 
forced-choice keys is 4.1%. 

The two comparisons involving differentia- 
tion of the two student groups from each 
other are in terms of discrimination between 
groups which should not be different. These 
discriminations represent validity shrinkage 
of the keys. One of the differences in the 
comparisons favors a forced-choice key and 
one favors an L-I-D key, but neither is sta- 
tistically significant. 


The 12 comparisons are, of course, not all 
independent since the same groups are used 
repeatedly. They do, however, represent dif- 
ferent aspects of the keys’ performances, and 
the results are fairly consistent. 


Discussion 


The results of the measurement of time re- 
quired to complete Form L and Form F 
do not support the hypothesis suggested by 
Zuckerman’s results that a forced-choice in- 
ventory would require less time than an L-I-D 
inventory of the same length. They do indi- 
cate, however, that Zuckerman’s conclusion 
that L-I-D inventories are more efficient than 
forced-choice inventories does not apply when 
the inventories are composed of the same 
number of items. 

The larger number of items scored and of 
scoring weights on the unit-weight forced- 
choice key than on the unit-weight L-I-D key 
suggests that forced-choice items provide more 
large percentage differences than do L-I-D 
items. If intercorrelations of items are about 
the same for the two forms, more differentia- 
tion of criterion and reference groups could 
be expected from forced-choice keys. The 


Table 3 


Differences in Amount of Overlappingt Provided by 
Experimental Forced-Choice and L-I-D 
Vocational Interest Keys 


Differ- 
ence in 
Over- 








Pair of 
Groups _ lapping oD 
Y-CS 12.8 4.88 
Y-VS 9.8 4.64 
CS-VS 1.2 13.70 
SY-CS 6.5 2.42 
SY-VS 17.1 2.71 
SY-DY —8.9tt 3.92 


Y-CS 0.0 
Y-VS 10.8 
CS-VS —4.8} 
SY-CS 5.4 
SY-VS 5.3 
SY-DY 0.0 


Keys 
Compared 





4.00 
4.32 
13.96 
2.09 
2.28 
3.47 


FM vs. LM 





* Significant at the .05 level. 
** Significant at the .01 level. ' 

+ Frequency overlapping is the measure considered. 
t Negative differences favor L-I-D keys. 
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difference in number of weights on the two 
multiple-weight keys shows the effect, when 
small percentage differences are weighted, of 
the restriction in number of weights available 
per item on the forced-choice form. The bet- 
ter results obtained with the forced-choice 
key, however, indicate that the restriction is 
of little importance. 

In doing the job for which they were de- 
signed, viz., differentiating yeomen from stu- 
dents, the forced-choice keys were quite con- 
sistently superior to the L-I-D keys. In the 
additional task of separating “occupationally 
satisfied” yeomen from students, this superi- 
ority was even more consistent. It did not 
extend, however, to the separation of satisfied 
and dissatisfied yeomen, for which an L-I-D 
key was superior in one case and neither key 
was superior in a second case. 

The finding of no difference between L-I-D 
and forced-choice keys in amount of overlap- 
ping of the two student groups or in increase 
in overlapping of yeomen and validation stu- 
dents over that of yeomen and criterion stu- 
dents does not support the hypothesis that 
validities of L-I-D keys will shrink more in 
cross-validation than validities of forced- 
choice keys. This was true even though the 
multiple-weight L-I-D key contained consid- 
erably more weights than the multiple-weight 
forced-choice key. 

The above results are based on only one 
vocational group, with a group of students 
rather than a men-in-general reference group 
with which to contrast responses of the voca- 
tional group. To the extent to which such 
results may be generalized, it may be said 
that forced-choice interest keys differentiate 
groups better than L-I-D interest keys. The 
average difference in percentage of overlap- 
ping, however, is relatively small. Any de- 
cision as to how great a difference in amount 
of overlapping must be in order to be of 
practical significance must probably be arbi- 
trary. It is the present writer’s preference to 
choose the type of item which appears to 
provide superior validity, regardless of the 
amount of superiority, if other characteristics 
do not outweigh a slight difference in validity. 
The present study suggests that the charac- 
teristics of “time required” and “weights 
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available” for forced-choice and L-I-D items 
do not outweigh the difference in validity. 
There may be other considerations, however, 
which would cause an inventory builder to 
prefer L-I-D items to forced-choice items de- 
spite the slight superiority of the latter in 
validity. Suspected differences in reliability 
or in resistance of subjects are possible ex- 
amples. 


Summary 


1. The present study was undertaken to de- 
termine whether forced-choice interest items 
give better results than L-I-D interest items. 

2. Results are based on two interest inven- 
tories using the same items, one in forced- 
choice form and one in L-I-D form. The in- 
ventories were administered to a group of 
Navy yeomen (clerical workers) and a group 
of college students. An indication was ob- 
tained of the occupational satisfaction of the 
yeomen. Unit-weight and multiple-weight 
keys were developed for each inventory to 
differentiate yeomen from students. Percent- 
ages of overlapping on each key were ob- 
tained of the yeoman and satisfied yeoman 
groups with criterion students (used in build- 
ing the keys) and validation students (not 
used in building the keys), of the two stu- 
dent groups with each other, and of satisfied 
and dissatisfied yeomen with each other. 

3. It was found that forced-choice keys 
were superior to L-I-D keys in separating 
groups in seven of ten comparisons. There 
was no difference in two cases, and an L-I-D 
key was superior in one comparison. The 
average superiority of forced-choice keys was 
a 5.9% decrease in overlapping. The unit- 
weight forced-choice key contained more scor- 
ing weights than the unit-weight L-I-D key, 
but for multiple-weight keys the situation was 
reversed. There was little difference in va- 
lidity shrinkage for the two kinds of items; 
all keys cross-validated satisfactorily. The 
difference in average time required to com- 
plete the two inventories was slight and not 
statistically significant. 

4. It was concluded that forced-choice in- 
terest items are superior to L-I-D items in 
differentiating groups. The difference is small 
enough, however, that other considerations 
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may well play a part in selection of item form 
for interest-inventory construction. 


Received July 14, 1954. 
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Testing programs in the fields of music and 
the arts represent relatively neglected areas of 
testing development. Efforts to develop a 
single instrument to measure accurately the 
complex of abilities called musical talent have 
been unsuccessful, although instruments have 
been devised that seem to measure some of 
the individual abilities which constitute this 
complex. Tests presently recognized as meas- 
ures of several facets of musical ability in- 
clude those developed by Seashore, and 
Kwalwasser and Dykema, among others. Re- 
views of existing tests of musical achievement 
may be found in a recent book by Lundin 
(4). 


The Test Concept 


While many of the abilities measured by 
current tests are important to good musician- 
ship, one ability—which seems to be basically 
involved in all fields of musical endeavor— 
has not been identified in previous test de- 
velopment. This ability is the power of audi- 
tory-visual discrimination. As the term im- 
plies, auditory-visual experience is the process 
of visualizing the notation of what is heard 
and of hearing with the inner ear what is seen 
in notation; it is utilized in the everyday ac- 
tivities of the practicing musician and is an 
important part of the basic musicianship 
training provided for music majors in col- 
lege. It is essential to musicians in every 
phase of musical experience, including com- 
position, performance, conducting, and peda- 
gogy. A test to measure the ability to dis- 
criminate sharply between correlated sounds 


1 All statistical work involved in the development 
and refinement of the Aliferis Music Achievement 
Test was performed by staff members of the Bureau 
of Institutional Research, University of Minnesota, 
Robert J. Keller, Director. The test development 
was supported in part by grants in aid of research 
from the Graduate School, University of Minnesota. 


and notation seems, therefore, to represent a 
significant new addition to music achievement 
testing. 

The Aliferis Music Achievement Test, Col- 
lege Entrance Level, makes possible specific 
evaluation and diagnosis of achievement in 
each of the three organizing forces of music— 
melody, harmony, and rhythm. It is the first 
test to measure this achievement in terms of 
auditory-visual discrimination. 

This paper provides the reader with the 
high lights of the development of the test. 
These include descriptions of test items and 
discussions of problems encountered, the na- 
ture of the test administrations, and the re- 
sults of test analyses and validation. Fur- 
ther detail and additional information may 
be obtained from an earlier publication (1) 
or by reference to the test and test manual 


published by the University of Minnesota 
Press. 


Musical Content and Psychological Controls 


A good music achievement test should in- 
corporate test items which are musically rec- 
ognizable to the musician, items which main- 
tain a high interest and motivation level. At 
the same time, other necessary characteristics 
of a good test cannot be sacrificed to attain 
musical relevance. For instance, the test 
must, as nearly as possible, measure only a 
single ability to a known extent. Test scores 
must not be influenced by the absence or 
presence of abilities not directly related to 
the one being measured, e.g., test scores 
should be unaffected by differences in intel- 
lectual level, type of instrument played, gen- 
eral background, etc. The problem of bal- 
ancing musical content with desirable test 
controls was the first hurdle in the develop- 
ment of the Aliferis test. 
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Test Controls 


Each item of the test was designed to in- 
volve one, and only one, critical problem 
regarding which the subject would make a 
decision. The application of this principle 
throughout a section of homogeneous test 
material would enable study of the subject’s 
reactions to a musical subject area. The 
test results could then be used for (a) com- 
parison of achievement in given subject areas, 
(6) accurate diagnosis from specific items in 
a section of the test, and (c) prediction based 
on the combination of the test results with 
other measurement information. 

To combine psychological test controls and 
musical content, the organizing forces of 
music—melody, harmony, and rhythm—were 
divided into two parts, elements and idioms. 
The element became the smallest possible 
musical unit which could be recognized out 
of context. Judgment was required on only 


one musical problem—recognition of the ele- 
ment—by eliminating all secondary factors. 
The idiom presented each of the elements in 
a simple context created by enlarging the 
configuration of the one problem contained 


in the element. In this way the idiom repre- 
sented a test item which approached the art 
form and brought the test closer to everyday 
musical experience without changing the con- 
trols established in the element. The element 
might be likened to a word in language; the 
idiom might be compared to a phrase, or the 
context of the word. 


Test Instrument 


The ability to be measured had been de- 
fined, and the nature of the test items had 
been determined. The next problem involved 
the means of presentation of the auditory 
stimuli. Instruments considered included the 
piano, electric organ, pipe organ, harmonium, 
violin, clarinet, and French horn. Use of the 
piano was seriously considered, particularly 
in view of its suitability for presentation of 
the harmony section of the test. The instru- 
ment best suited for the harmony section 
would be one capable of sustaining all four 
voices of a chord with equal intensity as well 
as uniform quality throughout a period of 
several beats. Laboratory reports showed, 
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however, that piano tone fluctuated widely, 
in quality and intensity, and rapidly became 
inaudible. Furthermore, musicians have been 
found, in piano performance, unconsciously to 
favor the soprano voice of one chord or the 
bass voice of another. In an attempt to 
avoid these mechanical and human short- 
comings, experiments were made using a pipe 
organ, an electric organ, and a small har- 
monium. Each of these instruments proved 
unacceptable, however. The organ was re- 
jected because its tone in the test situation 
was sufficiently unfamiliar to the average en- 
tering freshman that it distracted seriously 
from the test problem. Attempts to adopt 
the violin, clarinet, or French horn were un- 
successful because no one instrument could 
play through the pitch range of a complete 
test section. The use of several instruments 
in one section produced distraction because 
of changes in timbre. After considerable ex- 
perimentation, the piano was finally chosen 
as a compromise. Its inadequacies, both 
musical and psychological, were thought to 
be compensated for by its recognition as the 
representative instrument of the complete 
tonal realm, and by its widespread use by 
musicians. 


Other Test Characteristics 


Another problem required a decision as to 
whether to present one auditory stimulus to 
be identified from multiple visual images or 
several auditory stimuli to be compared with 
a single visual image. At the college entrance 
level, the testing situation which would pre- 
sent for recognition only one auditory stimu- 
lus and which, consequently, would not em- 
phasize retention or memory, seemed most 
appropriate. Multiple-choice answers in mu- 
sical notation were adopted to enable the 
untutored, incoming freshman to select an 
answer without restrictive reference to unfa- 
miliar nomenclature and terminology. 

Additional test characteristics were: limi- 
tation of the pitch range of each item to that 
pitch region which is known to present least 
difficulty for hearing, use of all single acci- 
dentals—natural, sharp, and flat—but no 
double accidentals in the test sections, and 
restriction to treble and bass staves. 
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The Test Items 
Melodic Elements 


The test of melodic elements uses all the 
melodic intervals from minor second through 
the octave. The intervals are played as 
shown in Fig. 1. In the test booklet the stu- 


MM g- 72-60 
Item 1 


Fic. 1. Melodic element—auditory stimulus. 

dent sees four intervals (as in Fig. 2) from 
which he selects A, B, C, or D as the correct 
answer. Figures 1 and 2, typical of items in 


A B Cc D 


Fic. 2. Melodic element—visual stimulus. 


this section, show that: 


1. Rhythm and harmony are not present. 

2. Duration, speed of presentation, and 
time space between items are controlled. 

3. The only judgment involved is the rec- 
ognition of the smallest melodic entity—a 
melodic interval as defined by two consecu- 
tive tones. 

4. The student does not have to name the 
interval (major 3rd, perfect 5th, etc.); he 
merely indicates his choice by marking A, B, 
C, or D as the answer. 

5. In this item, the first note the subject 
hears is B and the first note of each of the 
printed intervals is also B; this creates a 
“point of contact” which eliminates the need 
for judgment involving absolute pitch. 

Some evolutionary aspects of the melodic 
elements test items deserve discussion. Since 
nothing was known concerning the effect of 
interval direction upon difficulty of recog- 
nition, experiments in this area were neces- 
sary. In the first test edition, each interval 
was given four times, twice ascending and 


265 


twice descending. Then, employing techniques 
of item analysis, item-discrimination power 
was compared with direction of interval, the 
more discriminating direction of each interval 
being selected for inclusion in the next test 
edition. All intervals were given in the same 
pitch range to increase the validity of the 
procedure—G below middle C to D above 
high C. The effect of direction on recogni- 
tion was found to be as follows: intervals 
which were more discriminating when ascend- 
ing included the diminished fifth-augmented 
fourth, perfect fourth, major sixth, major and 
minor seventh, major third, major and minor 
second; intervals which were more discrimi- 
nating when descending included the minor 
sixth, octave, minor third, and perfect fifth. 
Even within these directional groups, the in- 
tervals varied greatly in their discriminative 
power, however. 

Experimentation showed that extension of 
the pitch range (up or down) of the poorly 
discriminating intervals increased their pow- 
ers of discrimination. Thus the seconds, 
which were nondiscriminating on the first test 
form, were gradually lowered or raised in 
pitch until they showed acceptable positive 
discriminating power. In this manner each 
interval was set at the place in the pitch 
range which would maximize its power of 
discrimination. The unison finally had to be 
excluded as it remained nondiscriminating 
even after the pitch range had been extended 
to the limits of the keyboard. 

Studies of distracters provided some inter- 
esting results. Early in the test development, 
distracters were used which employed inter- 
vals that went in the opposite direction to 
the interval which was the correct answer, 
and intervals that remained stationary (see 
Fig. 3). However, whenever these distracters 
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Fic. 3. Experimental forms of melodic elements. 
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(a) (b) 


Fic. 4. (a) Stepwise movement; (b) skipwise 


movement. 


were used in test items, the items proved to 
be nondiscriminating. Apparently the stu- 
dents could recognize the right direction of 
an interval and whether or not it was sta- 
tionary even though they could not identify 
the correct interval. As a result, all distrac- 
ters now in use provide intervals rising or 
falling in the same direction as the correct 
response. 


MM @= 58-52 


Item 1 


Fic. 5. 


Melodic idiom—auditory stimulus. 
Melodic Idioms 


Melodic idioms were constructed by plac- 


ing the melodic element (critical interval) in 
a simple context which preserved all the con- 
trols of the element. In the melodic idiom, 
four factors were found to be in constant flux: 
stepwise (Fig. 4a) versus skipwise (Fig. 4b) 
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Fic. 6. Melodic idiom—visual stimulus. 


movement, and ascending versus descending 
movement. To organize and control the 
varying effects of these forces, several pro- 


MM g= 68-63 
Item 1 


Fic. 7. 


A 


Fic. 8. Harmonic element—visual stimulus. 


cedures were evolved. If the critical interval 
was stepwise, it was approached skipwise 
from the opposite direction, and if the criti- 
cal interval was skipwise, it was approached 
stepwise from the opposite direction. 

The melodic idiom is presented as shown 
in Fig. 5. In the test booklet the student 
makes his selection from one of four alterna- 
tives (as shown in Fig. 6) by indicating his 
choice of A, B, C, or D as the correct an- 
swer. In this section the point of contact 
consists of the first three notes, including the 
first note of the critical interval. The an- 
swer, being the second note of the critical in- 
terval and the fourth note of the group, does 
not appear among the first three notes which 
are given. 


Harmonic Elements 


Harmonic elements employ four-voiced ar- 
rangements of major and minor chords which 
are used in all positions of the soprano (1, 3, 
and 5) and bass (root position, first and sec- 
ond inversions), and the first inversion of the 
diminished triad. Figure 7 shows the chord 
heard by the student. He makes his selection 
from the four chords shown in Fig. 8, mark- 
ing A, B, C, or D as the notated chord which 


Item 2 


Harmonic element—auditory stimulus. 
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Item 1 


Fic. 9. 


A 


—& 


Harmonic idiom—auditory stimulus. 
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Fic. 10. Harmonic idiom—visual stimulus. 


corresponds to the one played. Additional 
controls were incorporated in these test items 
as follows: 


1. The soprano tone of the chord that is 
played and the soprano note of each of the 
written chords is the same, thereby creating 
the point of contact. 

2. Although melody and rhythm are not 
present, duration of item and speed of ad- 
ministration are controlled. 

3. The alto, tenor, and bass voices vary in 
the test book so that they have to be ap- 
praised against the one voice of the harmony 
which is constant, visually as well as aurally. 

4. The soprano is the only voice which re- 
mains stable; hence the student is forced to 
move down from the soprano through the 
alto, tenor, and bass voices in order to make 
his decision. 


Harmonic Idioms 


The harmonic idiom test combines three 
harmonic elements in a chord progression dis- 
playing all the controls which were estab- 
lished in the test of harmonic elements. Each 
of the chord progressions is a recognizable 
harmonic idiom. The student hears a se- 
quence of chords (Fig. 9) and sees three pro- 
gressions in tha test book (Fig. 10). As be- 


fore he indicates the correct answer by se- 
lecting A, B, or C. The point of contact is 
the complete first chord as well as the soprano 
and bass voices, which remain constant. The 
soprano and bass voices were added to the 
first chord as points of contact in order to 
avoid visual cues in the peripheral area of the 
outside voices by which the student could 
make a decision without examining the inner 
voices of the harmonic mass. 


Rhythmic Elements 


The third and last section of the Aliferis 
test is devoted to rhythm. Again the test in- 
ventor was faced with the problem of devising 
test items which were both musically signifi- 
cant and psychologically sound. The first 
step was to establish a rhythmic figure of one- 
beat duration as the smallest rhythmic entity 
(see Fig. 11). However, two problems arose 
in this development: first, one-beat rhythmic 
figures have no significance when presented 
singly; secondly, if the rhythmic figures were 
stripped of all commonly coexistent musical 


One-beat rhythmic figures. 
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MM ¢ = 60 
Item l 


count out loud 


WS 


Fic. 12. 


factors, melody and harmony, there would be 
left only the tapping of rhythms, a procedure 
in conflict with the declared objective of cre- 
ating musical-test items. As a result, melody 
was incorporated into the rhythm test in such 
a way that each single-beat rhythmic figure 
was given three times in a melodic line which 
was scalewise in the key of C. An example of 
a rhythmic element item is shown in Fig. 12. 
Melody, as applied, helped create a more mu- 
sical presentation and aided perception and 
retention of the rhythmic figure without dis- 
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Rhythmic element—visual stimulus. 
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Fic. 13. 


tracting from the critical task. Although 
pitch is incorporated in the auditory stimulus, 
selection of answers is independent of the 
sense of pitch. The student merely selects the 
correct answer from four sets of rhythmic fig- 
ures (see Fig. 13) by marking A, B, C, or D. 
Before playing each rhythmic figure the tempo 
is established aloud by the administrator, and 
the students are asked to make some kind of 
motion in order to “feel” the beat. Each of 
the distracters contains the same number of 
notes as the right answer, and all rhythms are 


MM ¢- 60 
Item 1 


(count out loud) 


(WU 


Rhythmic element—auditory stimulus. 


given with the most common unit of notation, 
the quarter hote. 


Rhythmic Idioms 


The rhythmic idiom test combines two 
rhythmic elements into a two-beat figure as 
shown in Fig. 14. It is presented in the same 
manner and with the same controls as the 
rhythmic elements (see Fig. 15). 


Test Administrations 


The 64 items comprising the Aliferis Music 
Achievement Test were selected through criti- 
cal study of results obtained from two pilot 
administrations of earlier forms of the test. 
In 1947 the first form (330 items) was ad- 
ministered to 419 entering freshman music 
students in ten representative schools through- 
out the United States. In 1949 a revised edi- 
tion of 138 items was administered to 376 en- 
tering freshmen in ten selected schools. In 
1950 a second revision was administered to 
1,963 freshmen students in 70 educational in- 
stitutions. 


Test Standardization 

In the fall of 1950 the final form of the 
test was offered to the 190 colleges which 
held full or associate membership in the Na- 
tional Association of Schools of Music (here- 
after abbreviated NASM). Seventy of these 
colleges—a representative sample of the en- 
tire membership both by type of institution 
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~ 


Fic. 14. Rhythmic idiom—auditory stimulus. 
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Sy 

Fic. 15. 
and by geographic region—participated in the 
national administration. 

The participating schools were classified ac- 
cording to type of membership in the NASM. 
All schools except the junior colleges were 
also classified according to type of institu- 
tion. The four-year public and private uni- 
versities and colleges were grouped accord- 
ing to geographic region. 

Of the 70 participating colleges, 58 (83% 
were full members of the NASM. Of 7,444 
freshmen music majors entering all NASM 
member institutions, 1,963 (38%) took the 
test. 

Although the number of participating lib- 
eral arts colleges (NV = 32) was nearly equal 
to the combined total of all other types of 
participating schools (N = 36), the state uni- 
versity student representation exceeded that 
of the liberal arts colleges both in the total 
number of entering freshmen music majors 
and in the number of freshmen music majors 
who took the test. The total number of par- 
ticipating students from four-year institutions 
was 1,936. 

Schools were classified geographically as 
follows: 


Western: The Rocky Mountain and Pacific 
_ Coast states, including Montana, Wyoming, 
Colorado, New Mexico, and states west. 
Midwestern: Ohio, Kentucky, and all states 
west to and including Kansas, Missouri, and 
Nebraska; north to the Canadian border. 
Southern: Virginia, Tennessee, and all states 
west and south to and including Texas and 
Oklahoma. 
Eastern: States from Maine to Maryland 
and Delaware, and east to but excluding Ohio. 
National: This was not determined by the 
sum of «i tests since some areas were dis- 
proportionately represented; rather it was a 
weighted sum to which each area contributed 
in proportion to its total number of entering 
freshmen music majors. 
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Rhythmic idiom—visual stimulus. 


Fifty-six per cent of the participating 
schools were located in the Midwestern re- 
gion, providing more than half (62%) of the 
student participation. Midwestern colleges 
accounted for only 47% of the total member- 
ship of the NASM, however. The student 
sample for the geographic study was only 
1,768 because students in junior colleges and 
urban conservatories were not included. 


Test Analysis 


The final edition of the Aliferis test con- 
sists of 64 items, divided into three sections 
as follows: melodic, 26 items; harmonic, 18 
items; and rhythmic, 20 items. The Davis 
method of item analysis (2) was useful in se- 
lecting these 64 items from the 330 items on 
the first test. Item analysis of the test items 
used in the national administration showed 
difficulty indices ranging from 26 to 82. 
Fifty-four of the items had difficulty indices 
above 40. Difficulty levels between 40 and 
70 were found for 49 of the 64 items. Dis- 
crimination indices ranged from .07 to .69 
with only two indices less than .25. Thirty- 
eight of the test items possessed discrimina- 
tion indices of .40 or more. These figures 
indicate that the difficulty levels of the test 
items are quite satisfactory and that the test’s 
ability to discriminate between the good and 
the poor student in music, as measured by his 
total score on the test, is more than adequate. 


Test Intercorrelations 


Intercorrelations between part and _ total 
scores of the Aliferis test were computed to 
determine to what extent each section of the 
test contributed toward the total measure- 
ment of auditory-visual discrimination. Us- 
ing test results obtained from the national 
administration, substantial correlation was 
found between the melodic and harmonic sec- 
tions and between each test section and the 
total test score. Correlation coefficients were: 
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Tnmr — .66, mr = Al, Tmt = .90, Tir = a, Trt 
= .81, and r,,; = .68, where the subscripts m, 
h, r, and ¢ stand for melodic section, har- 
monic section, rhythmic section, and total 
test, respectively. The high correlation of 
the part scores with the total scores indicates 
that each section is contributing toward the 
measurement of a common musical ability. 
Evidence to date suggests that this ability is 
auditory-visual discrimination. 


Reliability 

Estimates of test reliability were computed 
using Hoyt’s method (3). The criterion 
group was a random sample of 100 from the 
1,768 students representing all four-year pub- 
lic and private colleges and universities but 
excluding urban conservatories. The esti- 
mated reliabilities for part scores and total 


score are: melodic, .84; harmonic, .72; rhyth- 
mic, .67; and total, .88. 


Analysis of Scores 
By Geographic Region 


Mean scores and standard deviations for all 
four-year colleges and universities are tabu- 
lated by geographic regions in Table 1 for 
the three sections of the test and the total 
test. Results of the test of homogeneity of 
variances and analysis of variance are also 
shown. Schools in the Western area ob- 
tained the highest mean score on each sec- 
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tion as well as on the total test. However, 
the variability is also greatest for schools in 
the Western area on all but the rhythmic sec- 
tion. Lowest regional mean scores for all sec- 
tions of the test and the total test are those 
of the Southern area. The smallest variance 
was obtained by the Southern area on the 
melodic section, by the Midwestern area on 
the harmonic section and the total test, and 
by the Eastern and Western areas on the 
rhythmic section. 


By Type of Institutions 


Mean scores and standard deviations, com- 
puted according to type of institution, for the 
three sections and the total test are shown in 
Table 2. F and L ratios are also indicated, 
showing the results of the analysis of variance 
and the test for homogeneity of variance, re- 
spectively. 

Highest mean scores on all sections, as well 
as the total test, were established by the pri- 
vate universities. Lowest mean scores were 
made by the teachers colleges, except on the 
rhythmic section where liberal arts colleges 
made the lowest mean score. Private uni- 
versities showed the highest standard devia- 
tion on the melodic section and nearly the 
highest on the harmonic section. Urban con- 
servatories had high variability on the me- 
lodic and rhythmic sections and highest on 
the harmonic section. Urban conservatories 


Table 1 


Mean Scores and Standard Deviations, with Significance of Differences, of All Four-Year Colleges, According to 
Geographic Regions, Aliferis Music Achievement Test, College Entrance Level 








Melodic 
Geographic 


Region Ze X SD 


Harmonic 


Rhythmic Total 
xX xX sD 





Eastern 
Midwestern 
Southern 
Western 
National 


13.0 5.2 74 3.8 
12.6 5.4 70 3.2 
11.1 5.0 6.7 3.4 
145 6.0 8.2 4.2 
12.4 5.4 7.1 = 3.5 


12.4 32.8 10.1 
11.6 31.1 9.3 
10.9 28.6 10.0 
12.6 $5.2 11 
11.6 31.1 10.3 





L ratiost 
F ratios§$ 


.996 991 
14.587** 6.120** 


997 .997 
12.216** 15.760** 





} This is a weighted sum to which each area contributes in proportion to its total number of entering freshman music majors. 
The L ratio tests the hypothesis that all the samples come from populations having the same variance. The hypothesis is 


rh in all cases. 


regions. 


he F ratio tests the hypothesis that there is no difference between the mean scores of the schools in the different geographic 
The double asterisk (**) indicates that this hypothesis is rejected at the 1% level of confidence. 





Measurement of Music Achievement 


Table 2 


Mean Scores and Standard Deviations, with Significance of Differences, of All Four-Year Colleges, According to 
Type of Institution, Aliferis Music Achievement Test, College Entrance Level 








Melodic 
Type of Institution xX SD xX 


Harmonic 


Rhythmic 


= of Tests 





Private Universities 
State Universities 
Liberal Arts Colleges 
Teachers Colleges 
Urban Conservatories 
All Types (Sum) 


13.9 5.6 8.0 
12.5 5.4 7.0 
12.2 5.2 6.9 
113 53 6.8 
133 5.5 7.5 
12.5 5.4 7.1 


12.7 239 
11.9 31.3 k 721 
11.0 30.1 J 573 
11.4 29.4 } 235 
11.9 32.6 y 168 
11.6 31.2 1936 





L ratiost 
F ratiost 


.999 
8.406** 


.989 
5.874** 


995 .992 
11.657** 11.241** 





t The L ratio tests the hypothesis that all the samples come from populations having the same variance. 


accepted in all cases. 


The hypothesis is 


The F ratio tests the hypothesis that there is no difference between the mean scores of the schools in the different geographic 
regions. The double asterisk indicates that this hypothesis is rejected at the 1% level of confidence. 


and private universities had the highest vari- 
ance on the total test. 


Validation 


The relationships of test scores with fresh- 
man and sophomore grades—music, academic, 
and total grades—were studied to determine 
to what extent the scores obtained from tests 
administered at the beginning of the freshman 
year agreed with grades given at the end of 
the freshman year and at the end of the 
sophomore year. Close agreement of test 
scores with grades would suggest that the test 
was measuring, to some extent, a form of 
musical activity which contributes to the final 
grade ratings, i.e., that there were aspects of 
achievement common to the testing and to 
the grading. Furthermore, if the Aliferis 
test measures only music achievement, the 
test scores should be closely related to the 
grades given to represent achievement in 
music courses and should be comparatively 
unrelated to the grades given for achievement 
in general academic courses (which might be 
considered an indication of general intelli- 
gence). 

To determine the extent of the relationship 
between test scores and grades, correlation 
coefficients were computed for each of the 
test section scores and the total score with 
honor-point ratios in music courses and in 
general academic courses. Correlations be- 


tween the test scores and freshman honor- 
point ratios are shown in Table 3; correla- 
tions between test scores and the two-year 
honor-point ratios are shown in Table 4. 
The one-year and two-year student samples 
overlap; therefore these are not independent 
estimates of validity. 

The substantial correlation found between 
the Aliferis test scores and honor-point ratios 
in music courses, both for the freshman year 
(.61) and for the freshman and sophomore 
years combined (.53), indicate that the Ali- 
feris test does not merely measure general in- 
telligence but measures, to a considerable ex- 
tent, an aspect of musicianship evaluated by 
teachers in assigning grades in music courses. 


Table 3 


Correlations Between Aliferis Test Scores and 
Freshman Honor-Point Ratios 
(N = 177) 








Freshman Honor-Point Ratios 





Academic 

Courses Courses All 
Only Only Courses 

Melodic 54 .22 40 

Harmonic 41 21 .33 

Rhythmic 46 15 34 


Total score 61 25 7 


Music 
Aliferis 
Test Scores 








Note.—Data compiled from official transcripts of students 
enrolled in four Midwestern state universities. 
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Table 4 


Correlations Between Aliferis Test Scores and 
Two-year Honor-Point Ratios 
(N = 123) 








Two-year Honor-Point Ratios 





Academic 


Music 
Aliferis Courses Courses All 
Only Courses 


Test Scores Only 


Melodic 57 .26 .39 
Harmonic 53 .08 38 
Rhythmic 25 il Al 


Total score SO .28 40 








Note.—Data compiled from official transcripts of students 
enrolled in four Midwestern state universities. Only 123 of the 
177 students in Table 3 completed two years of study as music 
majors. 


Summary and Conclusions 


Psychological concepts and problems in- 
volved in the development of a test of music 
achievement at the college entrance level are 
discussed in this paper. Test items are de- 
scribed and test uses are suggested. Dis- 


cussion of the experimental background and 
analysis of the Aliferis Music Achievement 
Test includes descriptions of populations in- 


volved in the pilot and national administra- 
tions, classification of the national population 
by type of institution and geographic region, 
and results of item-analysis, reliability, and 
validity studies. Results of these studies 
seem to justify the conclusion that the Ali- 
feris test fills a great need for a measure of 
music achievement at the college entrance 
level. It fills this need by testing the stu- 
dent’s power of auditory-visual discrimina- 
tion, using recognizable musical materials 
which incorporate adequate psychological con- 
trols. 
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The Effect of Stroke Width on Linear Interpolation 
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In recent years the problem of scale read- 
ing and setting has come to be recognized as 
important for instrument design, especially 
when interpolation of various positions is re- 
quired. : 

The present investigation is concerned with 
the effect of stroke widths on linear interpola- 
tion. The only previous work dealing with 
this specific problem was performed by 
Backstrom (1) in which he studied inter- 
polated readings between two markers. The 
end markers had a 2 mm. distance between 
their centers. A third marker was used to 
indicate various positions between the end 
markers. All three markers had the same 
stroke width, approximately .10, .24, and .35 
of the interval size. Backstrém concluded 
that the optimum stroke width was one-tenth 
the size of the scale interval, and that the 
thicker stroke widths tended to increase the 
error of reading. 

More recent investigators have been inter- 
ested in various other problems relating to 
linear interpolation. In 1949 Miller (3) 
studied the effects of interval sizes on inter- 
polation. He found no optimum interval size, 
since good subjects tended to be good and 
poor subjects poor no matter what interval 
size was used. Miller’s study was confined 
to interpolated readings. Levett (4) ex- 
plored both readings and settings on a linear 
scale and concluded that reading “Miller 
Cards” is not the same as making interpo- 
lated settings. Schubert (5) investigated the 
effects of training on the reduction of subjec- 
tive biases in making interpolated settings. 
Although training corrected biases tempo- 
rarily, the beneficial effects were dissipated 
after a period of time. 

The studies by Miller, Levett, and Schu- 
bert all used hairlines for the scale markers. 
Backstré6m used various stroke widths, but 
none that were of value in aiding an indi- 
vidual to interpolate. It was the purpose of 
this study to employ stroke widths which 


could be used as reference points to indicate 
particular positions along the scale. With a 
10-mm. interval between the end markers, 
stroke widths of exactly 1 mm., 2 mm., 3 
mm., and 5 mm. mark off exact tenths of 
the interval. Since it has been shown that 
greater discrepancies in settings occur in cer- 
tain positions of the scale (3, 4), it is pos- 
sible that there may be an optimal stroke 
width which will reduce the mean discrep- 
ancies to a minimum. 


Procedure 


The apparatus was essentially the same as that 
used by Levett (4) and Schubert (5) with the ex- 
ception of revision of the scale. Six different scales 
were used. One scale consisted of two vertical hair- 
lines 10 mm. apart on the stationary portion of a 
slide-rule arrangement, and a third vertical hairline 
on the movable portion. The other five scales had 
markers of 1 mm., 2 mm., 3 mm., 4 mm., and 5 mm. 
in width, and a 10-mm. distance between the mid- 
points of the stationary markers. This enabled a 
subject, using vernier acuity, to mark off two of the 
nine discrete settings for a particular scale by lining 
up one edge of the movable marker with the inner 
edge of one of the fixed markers. In this way the 
1-mm. stroke width could mark off Positions 1 and 
9, the 2-mm. stroke width marks off Positions 2 
and 8, etc. All the scales were on the same slide 
rule but a window-flap arrangement was provided 
so that only one scale was visible to the subject at 
a time (see Fig. 1). The series of scales was cov- 
ered with glass to facilitate cleaning. 

The slide rule was attached to a light lever which 
projected the setting of the variable marker on an 
enlarged scale, marked off in one-hundred units and 
visible only to the experimenter. Settings were made 
by turning a knob attached to a drive screw which 
fitted into a threaded block attached to the slide. 
Lighting was provided by a gooseneck tungsten 
75-w. lamp which could be adjusted to eliminate 
glare. 

Twenty Ss, 18 men and two women, were used 
for this experiment. All the men were students at 
Lehigh University. Four were business majors, 
eleven were psychology majors, two were graduate 
psychology students, and one was a mechanical en- 
gineer. One of the women was a secretary while the 
other was a business-machine operator. 

All Ss were instructed that the general task was to 
set one-tenth, two-tenths, three-tenths, etc. of the 
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a a | i 
a's BOhEdEtStI 
Fic. 1. A model of the slide-rule face (actual size indicated). 


interval as specified by the experimenter. They were Table 2 
permitted to work at their own speed and to use any 7 i De é 
method of estimation. Forty-five practice settings Informed Trials—Mean Discrepancies in Settings 
were made by each subject on the hairline scale be- asss 
fore starting the main trials. Stroke Width 
A trial consisted of 54 settings, that is, setting the 
nine positions on each of the six scales. For any Subject Hair. 2 3 + 5 
one scale the subject was required to move from > m > aero 
setting to setting, designated in irregular order by E. K. 2.93 1. — ape pol el 
the experimenter. With the exception of always pre- 2.82 , 1.36 1.08° 1.40° 1.57 
senting the hairline scale first, the experimenter 2.74 2: 1.44* 1.00* 1.29* 1.83* 
varied the order of scales to counterbalance order 2.69 2. ase? f30* 2" 130° 
effects. 2.63 1. a OT A? 
A series of eight uninformed trials followed by sim F 1.43* .99* .74* 1.15* 
eight informed trials was run on each subject. When 2.47 . 1.58* 1.01* 1.26* 1.46* 
uninformed, the S was not told the stroke width, 2.35 J 1.00* .99* .96* 1.36* 
while on the informed trials he was told the stroke 239 2: 1.79 1.76* 1.44* 2.50 
widths and also shown how to use the edge align- 229 1. 1.74* .86* 1.00* 1.68* 
ment to get Positions 1 and 9, 2 and 8, etc. The Ss 2.26 : 90* 1.03* 1.15* 2.06 
were given no knowledge of results in either the in- 214 1.74* 1.14" 1.07* 
formed or uninformed trials. Each S$ made a total ‘ : : : 
‘ é : : 2.14 1. inf 22 
of 864 settings in the 16 trials, making a grand total i. a & ia 
of 17,280 settings for all 20 subjects. 2.07 1.43" 1.31* 1.04* 1.38 
2.01 1.64 .93* 1.00* .90* 


Table 1 1.97 2.17 1.57 1.24* 1.01* 
Uninformed Trials—Mean Discrepancies in Settings 1.24° 90° = .72° 67" 
213 120° te 13 
Ca? 020 1 36 
Se a 
Hair. 2 3 4 5 GrandMean 2.29 1.75 1.27* 1.03* 1.04* 
3.29 3.26 3.11 2.89 2.64 
3.22 : 2.07* 2.49* 2.22* 2.32* * Significant at 5% level of confidence. 
3.07 2. 2.39 2.85 3.08 2.11* 
2.86 2.49 1.90* 1.96* 1.39* 2.06* Results 


2.85 2. 2.97 2.68 2.44 2.44 ‘ mer ‘ 
276 211* 249 261 2.79 The mean discrepancies in settings for the 


271 3.69* 3.03 3.83* 211* individual Ss are summarized in Tables 1 and 
2.65 1.96* 1.61* 1.86* 1.56* 2. Table 1 shows the mean discrepancy in 
2.65 1.96* 1.57* 1.57* 1.88* settings for each of the six scales for all 20 
2.40 240 2.08 1. 1.96 1.53* Ss on the uninformed trials. An asterisk in- 
2.31 1.60" 199 186 1.79  dicates that a mean discrepancy differs sig- 
re Ane sped som nificantly from that of the hairline scale. All 
724 254 2. 221 1.96 means are given in hundredths of the inter- 
2.22 2.21 1.72 1. 1.19* 1.71. val. It will be noted that on the 2-mm., 3- 
2.21 2.89 2.31 °2. 2.28 1.69 mm., 4-mm., and 5-mm. scales less than half 
2.17 2.07 1.24" 1. .88* .90* of the subjects had significantly fewer errors, 
2.08 231 281° 2. 2.35 — while on the 1-mm. stroke width none of the 
‘as a oo oe ak ail subjects made significantly fewer errors than 
. x > ae # F é 2.63* 3.08 til A 

— on the hairline scale. When the group is 
taken as a whole and the differences between 


* Significant at 5% level of confidence. the grand means are considered, the reduc- 
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Stroke Width and Linear Interpolation 


tions in errors are not significant for any of 
the thicker strokes. 

Table 2 presents the mean discrepancies in 
settings for the informed trials. For the 3- 
mm. scale and also for the 4-mm. scale all 
20 Ss, taken individually, made significantly 
fewer errors than on the hairline scale, indi- 
cating a superiority of these two stroke 
widths. On the other stroke widths more 
than half, but not all, of the subjects made 
significantly fewer errors. For all 20 Ss, 
taken as a group, the 2-mm., 3-mm., and 4- 
mm. stroke widths all have significantly 
lower grand mean errors than the hairline, 
with the reduction in error for the 3-mm. 
stroke width reaching the 1% level of con- 
fidence. 

Table 3 presents the mean discrepancies in 
settings on informed trials for the various 
stroke widths on all of the nine interpolated 
positions. These means are based on the dis- 
crepancies in settings on each position for 
all 20 Ss. The blocked-in numbers represent 


errors which may be attributed purely to ver- 
nier acuity. However, the reductions in dis- 
crepancy are not confined to these vernier 
acuity positions, but occur all along the scale, 


with the exception of Position 7 on the 1-mm. 
stroke width. 

Mean discrepancies in settings may be 
ascribed both to variations in settings and 
also to subjective biases. In order to deter- 
mine which plays the more important role in 
determining the mean discrepancies, an ex- 
amination of the individual data was made. 
In general there was a greater reduction in 
the algebraic errors (biases) than in the 
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standard deviations (variation) for the dif- 
ferent scale positions using the thicker stroke 
widths. Here again, the reduction in biases 
not only occurred for the positions that may 
be attributed to vernier acuity, but also for 
the other positions along the scale. 


Discussion 


The results on the hairline scale corroborate 
the findings of other investigators (2, 4), 
namely, that individuals differ in their sub- 
jective biases and remain reasonably consist- 
ent in their judgment. On the hairline scale 
Positions 1, 5, and 9 have relatively fewer 
errors than the other positions, indicating 
that the end and the center of the scale are 
more easily judged. This finding is also sup- 
ported by Levett since he found fewer errors 
in these same positions. 

The results do not, however, agree with 
those of Backstrém on the influence of stroke 
width. An adequate comparison with this 
study is difficult to make for three reasons: 
First, the present investigation used settings 
on a slide-rule arrangement, while Backstrom 
used readings from cards. Second, the pres- 
ent study required settings of even tenths of 
a 10-mm. interval, whereas Backstrom re- 
quired reading to the nearest tenth of a 2- 
mm. interval when various hundredths of the 
interval were indicated. (This is an unreal 
situation since if hundredths of the scale in- 
terval are marked, the S should be permitted 
to interpolate to hundredths, not tenths.) 
Third, it is very unlikely that Backstrém’s 
stroke widths of .24 and .35 of the interval 
could be used as reference indicators. Al- 
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Mean Discrepancy in Settings According to Scale Position 
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though his subjects were uninformed, the 
stroke width of .10 of the interval size had 
the possibility of being used as a reference 
indicator. This may partially explain his 
conclusion that this stroke width is optimum. 
In the present study it was found that the 
3-mm. and 4-mm. widths were superior, and 
that the kmm. width, Backstrém’s optimum, 
was the poorest. 

Perhaps the most interesting result is that 
with thicker strokes, especially 3 mm. and 
4 mm., reductions in errors occur on all the 
scale positions and not just on those positions 
which may be ascribed to correction by ver- 
nier acuity. 

Summary 


Twenty subjects were required to make in- 
terpolated settings on six different scales. All 
the scales had a 10-mm. distance between the 
mid-points of their end markers, but differed 
in stroke width (hairline, 1 mm., 2 mm., 3 
mm., 4 mm., and 5 mm.). With the five 
thicker strokes it was possible to align the 
markers so as to set two of the nine inter- 
polated positions on the basis of vernier 
acuity. 

Even when uninformed of the stroke width, 
about half the Ss made significantly fewer 
errors on the thicker stroke widths, although 
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the differences between grand mean errors for 
all 20 Ss, taken as a group, are not signifi- 
cant. When informed of the exact width of 
the markers, all 20 Ss, taken individually, 
made significantly fewer errors with the 3- 
mm. and 4-mm. widths than with the hair- 
line. However, the 2-mm., 3-mm., and 4- 
mm. stroke widths all had significantly fewer 
errors than the hairline when all 20 Ss were 
taken as a group. The reduction in error not 
only occurs on those positions along the scale 
which may be set by vernier acuity, but on 
all other positions as well. 


Received June 14, 1954. 
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Television Station * 
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With the beginning of operations of the na- 
tions’s first noncommercial educational tele- 
vision station, KUHT-TV (Very High Fre- 
quency), at the University of Houston in 
June 1953, it became possible to examine 
some of the possibilities inherent in educa- 
tional television. Two main problems have 
been dealt with so far: (a) effectiveness of 
television as a medium of formal course in- 
struction; (4) responses of the Houston-wide 
TV audience to programming on the new edu- 
cational TV station. 

With respect to the first of these problems, 
effectiveness of television instruction, Strom- 
berg (8) suggested that students taking a 
psychology course via television did slightly 
better on a final examination than those tak- 
ing the course in traditional classroom fashion. 
However, in this study the necessity of utiliz- 
ing the same instructor or instructors in both 
the TV and non-TV-instructed groups could 
not be met, and the statistical significance of 
differences was not determined. Husband 
(5), controlling the like-instruction factor, 
also suggested higher achievement in tele- 
vision-instructed groups, but again the sta- 
tistical significance of differences was not 
determined. Allen (1), using the identical 
instructor in both a TV-instructed and non- 
TV-instructed group with respect to achieve- 
ment in an ROTC short course at the Uni- 
versity of Houston, found no statistically sig- 
nificant differences in achievement. The pres- 
ent paper deals in part with further examina- 
tion of the problem of the effectiveness of TV 
instruction in both an elementary psychology 
course and a biology course. 

With respect to the second problem, re- 


1This paper was presented at the American Psy- 
chological Association, New York, 1954. 

2 Biology professor and psychology graduate stu- 
dent, respectively, who, under supervision, con- 
tributed some of the data used by the senior author 
in preparing the present paper. 


sponses of a city-wide audience to educational 
television programming, Otis (7) in a study 
in Cleveland (not involving, however, an 
educational TV station but, rather, educa- 
tional programs presented over a commercial 
station) found a preference pattern for col- 
lege courses on television which suggested 
generally that liberal arts courses were pre- 
ferred to science courses. The present paper, 
however, deals not only with a study of 
courses preferred by TV viewers, but also 
with a determination of the size of the 
KUHT-TV educational TV audience, the 
ranking of the favorite programs already on 
the station, and suggestions by the educa- 
tional-station viewers for future programs. 


Subjects 


To investigate achievement differences in an ele- 
mentary psychology course, 96 Ss enrolled in a tra- 
ditional campus lecture section, 17 Ss enrolled in a 
TV-lecture only, correspondence section, and 30 Ss 
enrolled in a TV-lecture-plus-campus-discussion sec- 
tion were used. 

To investigate achievement differences in an ele- 
mentary biology course, two groups of 78 Ss matched 
for college class, grades, and sex in both TV and 
non-TV sections were used. To determine the size 
and reactions of the Houston-wide audience of the 
University of Houston educational TV station, 384 
Ss randomly selected from the Greater-Houston tele- 
phone directory were interviewed. 


Procedure 


For the elementary psychology course study, final 
examination grades on a standardized final examina- 
tion of 150 items were compared for (a) the tra- 
ditional campus lecture section (non-TV), (b) a 
group that fulfilled the requirements for the course 
by simply watching the lectures on television and 
completing problems presented in a special manual 
prepared for the course by Evans (2), and (c) a 
group that watched the lectures on television and 
met on the campus for two discussion periods a 
week. In each instance the instruction was handled 
by the same professor. The traditional campus lec- 
ture or non-TV group was not “contaminated” by 
the television lectures, since this was a group that 
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had taken the course before the television series 
began. It was impossible, however, to “match” the 
three groups with respect to certain characteristics 
such as age or intelligence which might function as 
sampling biases, since no control over the composi- 
tion of these groups could be introduced and the n’s 
proved too small for any artificial matching. 

For the study of the elementary biology course, 
midsemester grades on a 70-item objective-type ex- 
amination were compared for a group that took the 
course entirely in the traditional campus lecture sec- 
tion, and a group that watched the lectures on tele- 
vision and met on the campus for two discussion 
periods per wek. In both instances the instruction 
was given by the same professor. Size of the two 
sections was large enough to allow the analysis of 
the achievement of two groups matched for age, 
grades, and sex. 

For the study concerning the size and reactions of 
the educational television audience, as part of a more 
elaborate survey conducted by McAdams (6), a 
standardized questionnaire was administered by tele- 
phone to the 384 Ss. The questionnaire was care- 
fully developed on the basis of two pilot surveys in- 
volving 100 Ss each, so that it contained questions 
with a minimum of ambiguity, and was of a length 
that would not discourage a telephone respondent 
from giving complete answers. All of the telephone 
interviews were conducted by two carefully trained 
individuals, so that a minimum amount of confusion 
would be involved in interpreting responses to the 
open-end questions included in the interview form. 
The questions were designed to determine on a con- 
tinuum how often the educational station was being 
watched. Three programs weekly or more was de- 
fined as “very often,” at least two programs a week 
was defined as “often,” at least one every two weeks 
was defined as “seldom.” A pilot questionnaire had 
been administered to 100 Ss. It included requests 
for the respondent to indicate programs he would 
like to see scheduled on the educational TV station 
and why. For the final questionnaire these choices 
were grouped as children’s programs, panel discus- 
sions, individual lectures, formal course offerings, 
classical music, sports, and educational films. The 
respondent was merely asked to indicate “yes” or 
“no” and “why” with respect to whether or not he 
wanted each of these program groupings to be in- 
cluded in future scheduling. 


Results and Discussion 


Mean final examination scores of 96.11, 
non-TV; 98.00, TV; and 99.04 in the TV- 
plus-discussion sections of the elementary 


psychology course were obtained. Fisher ¢ 
tests, which were computed to determine the 
significance of the differences among the 
three groups, suggested that the differences 
were not statistically significant. Likewise, 
the differences between the mean scores of 
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50.17 for the elementary biology non-TV sec- 
tion and 45.82 for the elementary biology TV 
section did not yield a statistically signifi- 
cant difference. 

The viewer’s responses to the audience re- 
action questionnaire revealed the fact that 
75% of the population possessing TV sets in 
the sample surveyed watched the educational 
TV station at least once every two weeks. 
Of this percentage (75%), 18% watched 
two programs per week or more on the station. 

It was further revealed that 50% of the 
station’s viewers had no favorite program, but 
14% most frequently watched a basketball 
sports program, 12% an elementary psychol- 
ogy course, 7% a forum discussion of inter- 
national events, 5% a Spanish course, 3% a 
home nursing series, 2% an English literature 
course, 2% the elementary biology course, 2% 
an art presentation, 1% a charades-like pro- 
gram called “Chalk Talk,” and among the re- 
maining 2% were distributed all the remain- 
ing programs on the KUHT-TV schedule. 

When the total of 384 respondents were 
asked to select from seven categories what 
programs they would like to see scheduled on 
the educational station, 83% selected sports 
programs, 73% chose educational films, 68% 
favored panel discussions, 60% desired chil- 
dren’s programs, 50% individual lectures, 
44% formal course offerings, and only 39% 
classical music. 

The reasons given for these different selec- 
tions were extremely heterogeneous. How- 
ever, some of the significant conclusions that 
may be drawn from both the quantitative 
and qualitative data obtained in the survey 
suggest that (a@) KUHT-TV, the nation’s 
first educational television station, is being 
watched quite regularly by a surprisingly 
large portion of the Houston TV audience, 
(6) several KUHT-TV programs of the past 
or present have developed small but extremely 
loyal audiences, (c) efforts in future edu- 
cational television programming should be 
geared to reach the family as a whole, (d) a 
very fertile field of interest among viewers 
exists in presenting TV programs for the 
child, and (e) some of the more formal edu- 
cational TV programs could profit by the 
use of devices such as educational films or by 
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the use of more dynamic and interesting par- 
ticipants. (This last finding is supported by 
a survey conducted by Evans [4] before the 
educational television began operations. It 
was found that 93% of a random sampling 
of Houston TV-set owners felt that educa- 
tional television should be entertaining.) 

In general, these findings suggest that (a) 
the efficiency of the classroom teacher is ap- 
parently not hampered by presenting lectures 
on television, and (0) a discriminating audi- 
ence, small when compared to commercial 
standards, but surprisingly large in light of 
the recency of the development of educational 
television, can be expected even a short time 
after the beginning of operations of a VHF 
educational television station in a large com- 
munity, if the response of the Greater-Hous- 
ton television audience is any indication. 

Further research, however, is called for in 
many areas. For example, what effect does 
educational television programming have on 
the attitudes of viewers. In a preliminary 
study, Evans (3) found that 60% of the 
viewers report an ability to “understand peo- 
ple better” as a result of watching an ele- 
mentary psychology course on television. A 
more qualitative analysis of this type of data 
is called for. Another problem that is sug- 
gested here deals with teaching techniques. 
Evans (3) found that a lecture involving use 
of a blackboard as a teaching technique over 
more elaborate visual aids was preferred by 
students in a telecourse. This preference 
might imply that the personality of the in- 
structor is an important factor in pleasing a 
television student, which, in turn, might sug- 
gest the need for further research to (a) dis- 
cover whether “preferred” television pres- 
entation techniques are related to learning, 
and (b) what types of personality traits are 
most desirable in a television instructor. 

At this early stage of development, how- 
ever, when educational television audiences 
are still small and the impact of educational 
television is so recent, it is the author’s opin- 
ion that precise large-scale evaluation is im- 
possible. We must be content with restricted 
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research attempts that may suggest crucial 
hypotheses for research in educational tele- 
vision when large-scale evaluations become 
possible. 


Summary 


The achievement of students enrolled in 
elementary psychology and biology television 
instruction and nontelevision instruction sec- 
tions was compared. No significant differ- 
ences were found. 

An audience-reaction survey of Houston 
televiewers revealed that 75% of the audi- 
ence watched the station from as often as 
three times a week to at least once in two 
weeks. Highly preferred programs included 
sports, a psychology course, and an interna- 
tional affairs discussion panel. Suggested pro- 
grams ranked most highly were sports pro- 
grams, educational films, panel discussion, and 
children’s programs. Least preferred pro- 


grams were of the classical music type. 


Received August 17, 1954. 
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Previous researchers (1) have suggested 
that in a group-performance test situation, 
scoring the intangible products of perform- 
ance resulted in a slightly lower interobserver 
reliability than scoring the tangible products 
of performance. The present study com- 
pares interobserver consistency when the tan- 
gible and intangible products of performance 
on individually administered performance tests 
are scored. Since measurements of the final 
product (tangible measurements) can be made 
at the examiner’s leisure, and since gauges 
and other measuring instruments can be em- 
ployed as aids in making these estimates, 
some performance-test constructors maintain 
that measurements of the final product should 
yield greater interexaminer consistency than 
measurements of performance in process (in- 
tangible products). An example of a meas- 
urement of performance in process (in- 
tangible product) is scoring the examinee’s 
technique in doing a job, or scoring his ad- 
herence to the prescribed safety precautions. 
An example of a tangible product is the ad- 
herence of the final product to prescribed di- 
mensions or standards. Since measurements 
of performance in process frequently tell 
where and how examinees erred rather than 
merely what mistakes were made, measure- 
ments of intangible products should be in- 
cluded in any performance test, provided 
there is no simultaneous loss in inter- (and 
intra-) observer consistency. 

The present research was directed toward 
investigating the conjecture that for indi- 


1The data herein reported were collected under 
Contract N-onr-872(00) between the Institute for 
Research in Human Relations and the Office of 
Naval Research. The opinions expressed are those 
of the author and do not necessarily represent those 
of the Office of Naval Research or of the naval 
service. 

2 The following persons made substantial contribu- 
tions to the research here reported: Drs. G. Douglas 
Mayo, F. Kenneth Berrien, F. Loyal Greer, Douglas 
Courtney, and Mr. J. Harry Hill, and Miss Mimi 
Taylor. 


vidually administered performance tests, final 
product measurement yields no greater inter- 
observer consistency than observations of per- 
formance in process. 


Method 


Eight performance tests, scored by the check-list 
method, were constructed. The construction of 
these tests has been previously described (2). The 
check lists generally included items in the following 
four areas: (a) observations of the procedure fol- 
lowed in doing the job, (b) observations of the ex- 
aminee’s adherence to safety precautions, (c) ob- 
servations of the examinee’s methods of using tools 
and equipment, and (d) measurements of the final 
product. Items in the first three of these areas were 
intangible measurements, while items in the fourth 
area were tangible. Five of the tests were directed 
toward measuring the ability of naval aviation struc- 
tural mechanics and three were directed toward naval 
aerial photographers. The aviation structural me- 
chanics’ tests included a rigid-tubing assembly test, 
a drill-point grinding test, a metal-working test, an 
aluminum butt-welding test, and a fabric-repair test. 
The tests for aerial photographers included a mo- 
tion-picture processing test, a test on the use of the 
Speed-Graphic camera, and a continuous-strip-print- 
ing test. 

The aviation structural mechanics’ tests were ad- 
ministered to 19 sailors at the Naval Air Station, 
Atlantic City. Two of these examinees were Avia- 
tion Structural Mechanics, first class; three were 
Aviation Structural Mechanics, second class; ten 
were third class; four were “strikers.’ Two Chief 
Aviation Structural Mechanics, independently but 
simultaneously, administered the tests to each ex- 
aminee. Chief “A” acted as one of the test ad- 
ministrators throughout, and scored all 19 examinees. 
Chiefs “B” through “F” each, simultaneously with 
“A,” administered the battery to from three to five 
of these examinees. Thus each examinee was inde- 
pendently evaluated by two examiners, examiner 
“A” and one other, both of whom simultaneously, 
but independently, scored his work. Precautions 
were taken to insure that the chiefs did not com- 
municate with each other during the scoring, and 
each scoring sheet was collected immediately after 
an examiner was finished with it. Thus, even if a 
chief decided on the basis of post hoc cross com- 
munication that he had erred, he was unable to cor- 
rect his mistake. Moreover, the presence of the re- 
searcher served to enforce “security.” 
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Interobserver Consistency 


For the aerial photographers a similar paradigm 
was used. However, instead of chief petty officers 
as examiners, four aerial photographers were chosen 
at random from the personnel at the photographic 
laboratory of the Naval Air Station at Atlantic City. 
One examiner (W) was constant throughout and 
administered all three photographic tests to a total 
group of 15 examinees. Each of the three other ex- 
aminers (X, Y, er Z) independently, but simul- 
taneously with “W,” evaluated five of the 15 ex- 
aminees. All the examinees were Aerial Photograph- 
ers, third class. This was the entire complement of 
aerial photographers in that rate at the photo- 
graphic laboratory at the Naval Air Station, Atlantic 
City. Since the aerial photographic tests were di- 
rected as an experimental hurdle for the rate of 
Aerial Photographer, second class, we attempted to 
limit our experimental population to those who were 
as typical as possible of those who could eventually 
take the tests. 


Results and Discussion 


An item analysis was first performed com- 
paring Chief “A’s” scoring of each item within 
a test area with the scoring on the identical 
item of the co-administrator who simultane- 
ously, but independently, scored the same ex- 
aminee. Percentage of consistency between 
simultaneous examiners for each test area 
was then obtained. Thus, 


Interobserver consistency within 
a test area = 
Number of items in test area scored in 
same manner by simultaneous observers 
Total number of items in area 





x 100 


The percentage of test-area interexaminer 
agreement for the various aviation structural 
mechanics’ tests is presented in Table 1. 

Similarly, for the aerial photographers it 
was possible to compare the agreement of ex- 
aminer “W’s” scorings with those of exmin- 
ers “X,” “Y,” and “Z.” The results of these 
comparisons are presented in Table 2. 

Table 1 does not suggest that the inter- 
examiner consistency in our sample of Chief 
Aviation Structural Mechanics was greater 
for measurements of the final product than 
for measurements in other test areas. For 
two tests, the mean interexaminer consistency 
was greater for all other measurements than 
it was for measurements of the final prod- 
uct. Similarly, for the aerial photographers 


Percentage of Interexaminer Agreement for Structural Mechanics’ Tests 











Fabrics 


Metal Working Welding 


Drill Point Grinding 


Rigid Tubing Assembly 





Tangible 


Tangible Intangible 


Intangible 


Tangible Intangible Tangible 


Intangible 


Tangible 


Intangible 








TUS Pre.? MIF.7 


M.F.P. 


T.U. Proc. Safety M.F.P. Proc. Safety 


M.F.P. 


T.U. Proc. Safety 


y M.F.P. 


Safety 


T.U. Proc. 


Examiners 








* Care and use of tools. 


** Procedure. 


+ Measurements of the final product. 





Arthur I. Siegel 


Table 2 


Percentage of Interexaminer Agreement for Aerial Photographers’ Tests 








Mot. Pict. Processing Speed Graphic Strip Printing 








Intangible Tangible Intangible Tangible 


Proc.* Safety M.F.P.** Safety M.F.P. 


W and X 95 93 93 82 66 
W and Y 94 97 96 90 90 
W and Z 96 93 93 87 90 
Mean 95 94 94 86 82 
o 81 1.8 1.4 3.3 


Intangible Tangible 


M.F.P. 








Examiners Proc. Proc. Safety 








* Procedure. ’ 
** Measurements of the final product. 


(Table 2), no regular superiority was seen in 
interexaminer consistency for measurements 
of the final product. 

The reason for the comparatively high in- 
terexaminer reliabilities seen for the measure- 
ments in the intangible areas may be an out- 
growth of the objectivity introduced into the 
check-list items and of the grossness of the 
observations called for in measuring perform- 
ance in process (intangible products). If we 
want to know if someone is dead or alive, we 
can use a stethoscope, but if the person is 
moving around, most observers will agree 
that he is alive without the use of the measur- 
ing instrument. Similarly, our observations 
of performance in process may have been 
gross enough and well defined enough to pre- 
clude the need for the stethoscope. 


Summary 


Interobserver consistency for measurements 
of intangible and tangible products of per- 
formance for individually administered per- 
formance tests was investigated. The data 
did not suggest superior consistency for tan- 
gible measurements. 


Received August 20, 1954. 
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Factor analysis provides one means for 
gaining insights into how the raters inter- 
preted the items of a rating scale. Inferences 
can be drawn with regard to the nature of the 
discriminations made and the extent of the 
discriminations. Furthermore, an estimate of 
the magnitude of halo effect in the ratings 
can be obtained. 

The results of a factor analysis can also 
prove useful in the construction of future rat- 
ing scales. “Good” items can be retained and 
new items developed to improve or broaden 
the areas measured by the scale. 


Data Analyzed 


A factor analysis was made of ratings ob- 
tained on 97 division managers. The men 
rated are responsible for administering the 
work of groups, ranging in size from 100 to 
200, of clerical employees. A few of the 
groups have less than 100 employees, and one 
or two, over 200. The ratings were made for 
administrative purposes by the managers’ 
superiors. 

The ratings were made on an assignment- 
type rating scale. The assignment scale is 
similar in form to the conventional graphic 
rating scale. The assignments, however, de- 
scribe job requirements rather than traits or 
personal characteristics of the ratees. 

The assignments were selected to cover the 
more important performance requirements of 
the manager’s job. The 20 assignments mak- 
ing up the scale are presented in Table 1. 

The response categories were the same for 
each assignment. Scale values were assigned 
each response category. The values were ob- 
tained by means of the method of paired 
comparisons (5). There were five response 
categories separated by approximately equal 
intervals. The categories ranged from “I 
would not want him” to “I would be en- 
thusiastic about having him.” 

1 This study was made at The Prudential Insur- 
ance Company of America while the author was a 


member of the Personnel Research Division of the 
Home Office. 


Method 


The first step in analyzing the data was 
that of scoring the response categories. Scores 
ranging from O through 4 were assigned. A 
score of 0 was assigned to the response cate- 
gory “I would not want him.” A score of 4 
was assigned the category “I would be en- 
thusiastic about having him.” Scores of 1, 
2, and 3 were assigned the intermediate cate- 
gories. 

Correlations between the scored assign- 
ments were then computed with the aid of 
IBM equipment. The method used was one 
developed by Toops (8). 

A modification of Thurstone’s group cen- 
troid method (7) was used in analyzing the 
correlation matrix. The particular method is 

, described by the author in a previous study 
(3). The resulting factors were rotated to 
orthogonal simple structure. The graphical 
method of rotating two factors at a time was 
used (6, pp. 73-78). 

Initial rotations made it evident that sim- 
ple structure within an orthogonal reference 
frame could not be obtained. Consequently, 
an additional factor was introduced. This 
factor was rotated with the initial centroid 
factors as a general factor and made pos- 
sible rotation within an orthogonal reference 
frame.” 

Fiske (2) reports using this, or a similar 
method, in three studies that he carried out. 
He obtained a general factor by setting a 
plane orthogonal to four settled planes. The 
method evidently obtains results similar to 
those obtained by the Burtian Iterative Sum- 
mation method described by Eysenck (1). A 
general factor is extracted first rather than at 
the end of the analysis. 

Once the rotations were completed a trans- 
formation matrix was computed (7) and ro- 
tated loadings obtained. The latter were 


2 The method used was presented in a course on 
factor analysis at The Ohio State University by Dr. 
Robert J. Wherry. 
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then adjusted by a method developed by 
Wherry (9) in order to correct for the initi- 
ally erroneous estimations of the communali- 
ties. The final loadings were then used in 
obtaining residual correlations. 

The average variance for each of the ob- 
tained factors was computed. Finally, the 
factor loadings were inspected and the factors 
were named. 


Results 


The intercorrelations are all positive and 
generally high.* They range from .22 to .82, 
with a median of .55. The residual correla- 
tions range from — .08 to + .09, with a 
median of — .005. Seventy-nine per cent of 
the residuals lie within the range of — .03 
and + .03. f 

The final rotated and corrected factor 
loadings are presented in Table 1. Both the 
assignments and the factors have been re- 
arranged to facilitate reading the table. 
Wherever possible, the assignments having 
the highest loadings on a particular factor 
have been grouped together, and the factors 
have been ordered according to their relative 
importance. The average variance for each 
factor is also presented in Table 1. 

One general factor and five group factors 
were obtained. The communalities range 
from .63 (Assignments 4 and 10) to .88 (As- 
signment 14). On the average 73% of the 
variance of each of the assignments is ac- 
counted for by the six factors. The general 
factor accounts for 31% while the group fac- 
tors account for 42%. 

The loadings on Factor I, the general fac- 
tor, vary from .33 (Assignment 18) to .71 
(Assignment 5). The average loading is ap- 
proximately .56. Factor I may be described 
as representing the extent of “halo effect” in 
the ratings. 

Factor II is a very well determined factor. 
Eight of the assignments have loadings of .50 


8 The correlation matrix has been deposited with 
the American Documentation Institute. Order Docu- 
ment No. 4593 from ADI Auxiliary Publications 
Project, Photoduplication Service, Library of Con- 
gress, Washington 25, D. C., remitting in advance 
$1.25 for microfilm or $1.25 for photocopies. Make 
checks payable to Chief, Photoduplication Service, 
Library of Congress. 
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or greater. Seven of the eight involve di- 
rectly dealing with others (peers, superiors, 
or subordinates). Because of this fact and 
the particularly high loading on Assignment 
18, Factor II was named “skill in dealing with 
others.” 

The remaining factors proved more difficult 
to name than Factor Il. They are not as 
well determined, each having relatively few 
“high” loadings, nor is the nature of each fac- 
tor so apparent from inspection of the assign- 
ments having the highest loadings. They 
were tentatively named as follows: Factor III, 
“judgment”; Factor IV, “effectiveness in su- 
pervising the work”; Factor V, “effectiveness 
in planning the work”; Factor VI, “effective- 
ness in improving operating efficiency.” 


Discussion 


Though the general factor accounts on the 
average for more of the variance of the rat- 
ing items than do any of the group factors, 
the group factors taken together account for 
more of the item variance than does the gen- 
eral factor. Thus, though the effect of halo 


is sizable, the raters did make rather sharp 


discriminations among the assignments. This 
finding suggests that when a rating scale is 
properly designed and administered, even 
though for administrative rather than for ex- 
perimental purposes, the raters can and do 
discriminate among the items. Factor analy- 
sis thus becomes a useful tool in determining 
what discriminations were made. 

It should be pointed out that the estimate 
of the magnitude of halo effect in the ratings 
is to some extent determined by the way in 
which the factors are rotated. In this study 
a deliberate attempt was made to eliminate 
the negative loadings. The rotations neces- 
sary to accomplish this goal tended to reduce 
the magnitude of the loadings on the general 
factor. 

It was pointed out that with the possible 
exception of Factor II the names of the group 
factors are offered as tentative hypotheses. 
If the same population were to be rated again, 
these hypotheses could serve as a basis for 
constructing additional items that would serve 
to verify or refute the initial hypotheses. 
Thus an assignment used in a previous analy- 
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sis (4), namely, “To a job requiring ability 
to analyze a situation and to weigh the facts 
in order to arrive at a logical answer,” would 
serve to test the hypothesis that Factor III 
reflects evaluations of the ratee’s “judgment.” 

The finding that raters can make discrimi- 
nations among various aspects of work per- 
formance suggests the possibility that rating 
scales could be useful as a means for deter- 
mining strengths and weaknesses and not 
merely as a,means for obtaining over-all 
evaluations of work performance. As such 
they could aid materially in the design of 
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nel in organizations where their demonstrated 
strengths would make them most effective. 

It must be recognized, of course, that. indi- 
vidual rating items make a contribution be- 
yond that reflected in the results of a factor 
analysis. Without information regarding the 
reliabilities of the individual items, however, 
there is no way of assessing their specific con- 
tributions to the total item variances. Be- 
cause the communalities of the items are gen- 
erally high, it is doubtful whether in this 
instance the specific variance of any of the 
assignments would be very great. 


training programs (to correct for rated de- 
ficiencies) and in the administration of ex- 
ecutive development programs. They could 
also assist in placing administrative person- 


Summary 


A factor analysis of 20 assignment rating 
items used in rating the division managers of 


Table 1 


Factor Loadings 





Factor 





Assignment I Ill IV 


. To a job in which tact and ability to get along with people 
is the most important qualification for success. 33 05 05 





. To work with a group of managers from divisions closely 
related to his in order to find the most satisfactory solution 
to all of any kind of mutual problem, such as the resched- 
uling of work, the transfer of personnel between divisions, 
or the installation of new methods. 


. To meet with top management and present to them an accu- 
rate picture of employees’ attitudes and viewpoints. 


. To manage a division in which changes are occurring in 
organization, scheduling, and work methods, thus requiring 
the ability to adjust to change without resistance, personal 
bias, or emotional upset. 


. To manage a division which requires constant and close 
coordination of its activities with those of other divisions. 


. To take charge of a division where morale is low in order to 
discover and correct the source of discontent. 


. To introduce a new work procedure in a division in such a 
way as to gain the full cooperation of the employees. 


. To meet with general managers to represent your point of 
view and to make decisions in your name. 


. To write a letter for your signature answering a complaint 
from a policyholder which resulted from work done in his 
division. 

. To a job where for success he will have to acquire a broad 
knowledge not only of his own job but also of all phases of 
the business related to his job. 
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Table 1—Continued 





Factor 





Assignment IV 





. Toa job similar to his present job except that he would be 
required to act on his own responsibility for several months 
at a time in the absence of his immediate superior. 


. To supervise a complicated clerical operation requiring con- 
sistent application of specific detailed rules and procedures. 


. To explain clearly to employees a complicated and impor- 
tant clerical operation. 


. To train a large number of clerks to perform new opera- 
tions. 


. To study the operations of a division for the purpose of 
reorganizing the flow of work and work methods in order 
to bring about more effective handling of business. 


. To manage a division in which work fluctuates widely in 
volume, thus requiring extensive advance planning for best 
utilization of personnel during slack and busy periods. 


. To establish the organization structure and staffing require- 
ments for a new division of the company. 


. To take charge of a division for the particular purpose of 
eliminating protracted overtime. 
. To take charge of a division for the particular purpose of 
meeting new higher production standards which have been 
introduced. 11 00 —02 15 AO 


. To take charge of a division for the particular purpose of 
eliminating excessive errors or delays in handling the work. 00 10 23 13 37 72 


Average Variance 17 07 07 06 05 73 








sources. J. abnorm. soc. Psychol., 1949, 44, 
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a large insurance company is reported. A 
general factor and five group factors were ob- 
tained. 

Though on the average the general factor 
accounts for more of the accounted-for vari- 
ance of the rating items than does any single 


group factor, the group factors together ac- 
count for more of the average total accounted- 
for variance than does the general factor. 
The group factors are named and the impli- 
cations of the results for personnel manage- 
ment are discussed. 


Received December 7, 1954. 
Early publication. 
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One of the major services offered by psy- 
chological consulting firms to business and in- 
dustrial organizations has been that of per- 
sonnel evaluation. Many organizations feel 
that they can plan the work assignments and 
future development of their sales and mana- 
gerial personnel much more readily if they 
have a clear picture of the psychological 
make-up of the individuals involved. Because 
management must often make fast decisions, 
because too few personnel are involved, or 
because the selection problem may be non- 
recurring in nature, it is often impossible to 
base personnel selection on a_ procedure 
backed up by sound research findings. In 
such instances, consulting firms typically base 
their selection procedures on an “educated 
guess.” The guess is “educated” by a com- 
bination of specific psychological measures 
selected with knowledge of the position to be 
filled and knowledge of the general validity 
of various selection techniques. 

This method of personnel evaluation has 
survived and, in fact, prospered on the basis 
of “faith validity.” While, in most cases, an 
effort is made to improve the assessment tech- 
nique in the light of information and experi- 
ence gained through increased contact with 
specific job situations, consulting firms have 
not—in the past—applied research methods 
to validate these professional services. 

The Personnel Audit Program of the Per- 
sonnel Research Institute offers to business 
and industry a personnel evaluation service 
based on a professional, nonresearch approach. 
Although the specific techniques of personnel 
assessment have changed from time to time 
within this organization, the usual procedure 


1 Now at Personnel Research Section, AGO. 


has been to administer to each applicant a 
battery of standardized paper-and-pencil tests 
selected to measure factors felt to be essen- 
tial to success in the specific position under 
consideration. In addition, each applicant is 
interviewed simultaneously by two psycholo- 
gists who probe to uncover factors related to 
job success and general personal and social 
adjustment. In recent years, projective tech- 
niques have been incorporated in all Person- 
nel Audits. Results from the various phases 
of each audit are integrated by a psycholo- 
gist and presented in a final report to the 
management of the client company. Predic- 
tions as to job success and recommendation 
as to future development of the employee are 
included. 

During the ten-year period in which the 
Audit Program has been in operation, the 
case load has increased from 10 cases in the 
first year to 279 in the past year. This rapid 
and continuous rise in cases indicates the 
confidence placed in this service by the client 
companies. Also, many individual users have 
reported that Audit reports were verified by 
their experience with the auditees. Although 
there exists a faith that the personnel assess- 
ments made through the Audit Program con- 
tribute significantly to the personnel decisions 
made by the client companies, this faith has 
not previously had systematic research as its 
basis. The present study is the first step in 
a research program designed to estimate the 
validity of Personnel Audits. 


Purpose 


The specific objectives of the present study 
fall into two groups. First, there are those 
objectives which constitute an operational at- 
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tempt to estimate the validity of the assess- 
ments and predictions made in the Audit 
Program. These are: (a) to evaluate the 
Personnel Audit predictions of over-all job 
performance, (b) to evaluate the Personnel 
Audit predictions of performance in several 
specific areas, (c) to determine directly the 
validity of certain paper-and-pencil tests used 
in the Audit Program as a partial basis for 
the above predictions. 

A second group of objectives is more gen- 
eral perhaps, and derives from the pioneering 
aspects of this study. In some methodological 
phases, this is a pilot study. First, it is un- 
usual in the heterogeneity of the sample so 
far as job assignment is concerned. Included 
in the study are men who fill a wide variety 
of positions in a large number of companies. 

The criterion instrument devised for this 
study had to be general enough to cover this 
range of jobs. At the same time, it was de- 
vised to obtain more than a simple over-all 
rating of job performance. One purpose of 
the study, then, is to see how well specific job 
areas can be evaluated independently by this 
type of scale under these conditions. 

A second general problem with which this 


study is concerned is that of reducing descrip- 
tive verbal material to summary quantitative 
form so as to facilitate statistical analysis. 
The present research provides a trial run of 
one simple and inexpensive method for ac- 
complishing this end. 


Procedure 
The Sample 


All men audited between January 1, 1951 and De- 
cember 31, 1952 were included in this study if they 
were employed by a client company sufficiently long 
to permit evaluation of their performance. Some 
cases were eliminated because complete criterion rat- 
ings were not provided by the company. 

Adequate ratings were obtained for 106 men and 
1 woman. To simplify the computations the N was 
reduced to 100 by discarding seven cases selected at 
random prior to any analysis. These 100 auditees 
constitute the sample and were employed by 18 dif- 
ferent companies with one company employing 33 
and 13 companies employing only one each. 

The auditees filled positions which may be roughly 
classified as follows: 


1. Eleven perform engineering duties primarily. 

2. Seven perform accounting duties primarily. 

3. Four perform general office duties not including 
supervisory functions. 
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. Thirty are in general supervisory positions. 

. An additional 21 supervise salesmen. 

. Twelve hold sales-engineering positions that re- 
quire the ability to apply technical knowledge 
to customers’ problems. 

. Eight are in sales positions requiring no tech- 
nical skills. 

. Four are sales trainees. 

. Three hold miscellaneous positions difficult to 
classify. 


The Predictors 


Since no research was planned for the Audit Pro- 
gram in the early years, no quantification of the as- 
sessments made in those years were available. To 
quantify the test and interview material and the de- 
scriptive verbal reports made to the client companies, 
a rating plan was adopted. Two psychologists read 
each audit file consisting of test results, interview 
notes, and the final report. They then made inde- 
pendent ratings for each case. The reliability be- 
tween raters was computed (an intraclass correla- 
tion) and corrected by the Spearman-Brown for- 
mula; the corrected coefficients are presented in 
Table 2, and the sums of these ratings were used as 
the predictors. 

For each case five rating scales were executed. 
Each scale consists of a brief description of the con- 
tinuum plus descriptions of each of five discrete 
steps on the continuum. The rater checks the ap- 
plicable step. The results indicate that these scales 
are not models to be emulated; so, for the sake of 
brevity, they are not reproduced here. They can be 
identified thus: 


Scale P:—Sociability (how well the man works 
with and is liked by others). 

Scale Pz—Organizational Ability (ability to plan 
and organize his work). 

Scale Ps—Drive (degree of “push” and “stick-to- 
it-iveness”’) . 

Scale P,—Over-all performance (of present job). 

Scale P;—Potential (for advancement in the nor- 
mal organizational hierarchy). 


In addition to the rating scales, those tests that 
had been given to 47 or more men were validated 
directly. Of the tests used, only the PRI Classifica- 
tion Test and PRI Tabulation Test are likely to be 
unfamiliar. The former is a short measure of gen- 
eral mental ability; the latter is a clerical test re- 
quiring speedy arithmetic computation and _ tabula- 
tion. 


The Criteria 


The same five scales used as predictors were used 
to obtain criterion ratings. Each man was rated on 
the basis of his job performance by one or more of 
his closest supervisors. The reliability between cri- 
terion ratings was computed for the 72 cases where 
two or more ratings were available. Three ratings 
were available for 25 cases. Random pairs were 
drawn to compute the interrater agreement coeffi- 
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cient (an intraclass correlation). The coefficients, 
corrected by the Spearman-Brown formula, are pre- 
sented in Table 2. Where more than one rating 
was available, the mean of these several ratings was 
taken. For each man, then, there resulted one aver- 
age rating on each of the five scales representing 
his supervisor’s judgments about his present over-all 
performance, his potential, and his performance in 
three specific areas. 


Analysis and Results 


The analysis was conducted in two phases. 
The first phase was concerned with the rela- 
tionships among predictive and criterion rat- 
ings; the second phase was an investigation 
of the direct validities of the psychometric 
tests. 

The specific problems involved in the first 
phase were: (a) to determine the validities of 
each of the five predictive ratings in respect 
to the corresponding criterion rating (such as 
relationship of P; and C;, Pz, and Co, etc.), 
(5) to investigate the nature of the predictor 
and criterion variance by factorial analysis, 
(c) to eXamine “inappropriate” relationships 
(such as P; and Co, P; and Cs, etc.) in com- 
parison with the “appropriate” relationships 
specified in (a) above. 

The analysis began with the computation 
of the entire product-moment intercorrela- 
tion matrix for predictors and criteria. (See 


289 


Table 1). The appropriate validities (itali- 
cized in Table 1) range from .21 for Scale 
1, Sociability, to .38 for Scale 5, Potential for 
Advancement. At the 5% level the lower 
confidence limit exceeds zero in every instance 
and the upper confidence limits range as high 
as .55. Examination of the off-diagonal cor- 
relations in the upper right-hand quadrant of 
Table 1 reveals that with the exception of 
Scale 5, two or more inappropriate coefficients 
in the adjoining rows and columns exceed the 
appropriate values in the diagonal. 

Thus, while Table 1 seems to reflect over- 
all validity of the five ratings, the pattern 
does not clearly indicate specific or differ- 
ential validity for the individual predictors. 
The next step in the analysis, therefore, was 
a multiple-group factor analysis. 

The criteria and the predictors were used 
as separate groups, and two orthogonal fac- 
tors were extracted. These factors revealed 
strong individual halos on the part of the cri- 
terion and predictor raters, but there was also 
a definite tendency for the variables to over- 
lap. The correlation between the group cen- 
troids was .43. The correlation of sums of 
standard scores between criterion and pre- 
dictor ratings was computed to be .37. In 
order to reflect this over-all validity in the 
factor structure and to examine it apart from 


Table 1 


Intercorrelations, Residuals, Means (X), and Standard Deviations (o) of Criterion (C) 
and Predictor (P) Ratings * 








Ratings Cy Ce C; 


Cs P; P2 P; Py 





39 


C:—Sociability 
C.—Organizational Ability 
C;—Drive 

C,y—Job Success 
C;—Advancement Potential 


—04 68 
01 — 
05 —O1 

—03 —02 


—01 
02 


—07 
—03 
05 —04 O4 
—03 -01 -—@ 
00 01 01 


X 3.69 
o 0.85 


P,—Sociability 
P,—Organizational Ability 
P;—Drive 

P,—Job Success 
P;—Advancement Potential 


—02 


3.52 
0.91 


3.84 
0.87 


C, 
44 51 21 10 15 14 
80 76 25 28 17 35 
65 68 16 22 27 
— 73 23 22 20 

—05 30 22 10 34 


—01 
00 
03 

—02 

—01 


30 21 40 
_ 53 65 
06 — 54 
—07 —02 
02 —Ol 


03 
00 
07 
—02 


—09 
00 
04 


06 


3.73 
0.81 


3.52 
1.06 


3.26 
0.63 


3.50 
0.69 


3.52 
0.60 


3.26 
0.77 


2.82 
0.96 





* Decimal points have been omitted from intercorrelations and residuals; residuals are below principal diagonal. 
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Table 2 


Factor Loadings (G, C, P), Communalities (h?), Corrected Reliabilities (rir), and Specifics (S) of 
Criterion (C) and Predictor (P) Ratings * 








Ratings G 


Cc 4 





C;—Sociability 32 
Cz—Organizational Ability 57 
C;—Drive 51 
C.—Job Success 54 
Cs—Advancement Potential 58 


P,—Sociability 28 
P.—Organizational Ability 47 
P;—Drive 37 
P,—Job Success 58 
P;—Advancement Potential 52 


—03 
66 02 
01 

67 
69 —02 


14 23 
—03 62 
—02 48 

02 72 

07 61 





* Decimal points omitted. F 
** Interrater reliability less than the communality. 


the individual halo effects, a general factor 
was introduced.? 

Table 2 presents the results of the factor 
analysis along with the corrected interrater 
reliability coefficients and estimates of the 
specific variances for each rating. As the 
asterisks indicate, the reliabilities were fre- 
quently less than the communalities, and 
zeros were entered rather than negative quan- 
tities. This was done because the reliabilities 
are known to be underestimates. The inter- 
rater coefficients express only the amount of 
reliable variance common to the raters—the 
agreed-upon variance in the sum of ratings. 
Any reliable or systematic variance specific 
to the individual raters is neglected by the 


2 Since this process of introducing a general factor 
is not widely known, a brief explanation is in order. 
An additional factor may be introduced into any 
factor structure simply by adding an additional fac- 
tor column with zero loadings. By rotation with 
the extracted factors, the added factor may be made 
to absorb factor variance from them. The purpose 
in adding an additional factor in this way is to ab- 
sorb common variance shared by two or more other- 
wise different factors to permit their presentation in 
a simple structure. This accomplishes the same pur- 
pose as using so-called oblique factors, and the same 
general end is achieved. The main difference is that 
the orthogonality of the factors is maintained, and 
the fact that correlation exists between the oblique 
vectors is not confused by the presentation of second- 
or higher-order factors. Unlike an oblique rotational 
solution, this method makes the second-order factor 
an explicit general factor. The variables which con- 
tribute most to general factors may readily be deter- 
mined and the correlation matrix is accounted for by 
the factors as they are presented and interpreted. 


corrected interrater coefficient. Theoretically 
it is impossible for the communalities to ex- 
ceed the reliabilities unless the reliabilities 
are underestimated. Therefore, the specifics 
of Table 1 must be considered underestimates. 
Each rating probably involves some specific 
variance even though zero is indicated. Just 
how much is unknown. 

The importance of the specifics lies in the 
fact that both predictor and criterion ratings 
of sociability yield very definite specifics in 
spite of the underestimation. The relatively 
low loadings of the sociability ratings on the 
respective halo factors and the general va- 
lidity factor together with the marked spe- 
cifics suggest that this is one rating area 
which both predictive raters and criterion 
raters were able to differentiate to some de- 
gree. The residual table further suggests 
that a small doublet factor based on this rat- 
ing area might be obtained. This factor 
would, however, have been too small for con- 
fident interpretation. Its existence is properly 
left as an area for future research. 

Factor G is interpreted as a general validity 
factor. It represents the tendency for the 
predictive ratings to agree with the criterion 
ratings in an over-all fashion. Factors P and 
C represent the differential rating biases of 
the predictor and criterion raters respectively. 
They are the anticipated criterion and pre- 
dictor rating halos. 

Comparison of the factor loadings for pre- 
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dictors and criteria on the general and the 
separate halo factors shows that the predictor 
loadings are much more heterogeneous than 
the criterion loadings. This suggests that the 
halo effect is much less within the predictors 
than within the criteria. Certainly it would 
be pleasantly encouraging to believe that psy- 
chologists were less susceptible to this effect 
than untrained industrial raters. Unfortu- 
nately, examination of the reliabilities reveals 
an almost identical parallel fluctuation. The 
apparent differences in strength of halo could 
well be due to the differences in reliability. 
Given relatively equal reliabilities or a simi- 
lar pattern of reliabilities in criteria and pre- 
dictors, similar loadings on both the general 
and halo factors could be expected. This 
possibility also makes it unwise in the present 
research to draw conclusions regarding the 
relative merits of the several predictor rat- 
ings in terms of their loadings on factor G. 
In the second part of the analysis, the 
validities of the objective tests used in the 
audits were computed against the five cri- 
terion ratings. Raw scores were obtained for 
the PRI tests, the Cardall, How Supervise, 
and the Allport-Vernon. Median 7 scores 
were obtained for each of the areas of the 
Strong, and C scores were obtained from the 
Guilford-Zimmerman. Product-moment cor- 
relations, 5% confidence limits, means, and 
standard deviations for the tests are pre- 
sented in Table 3. Means and standard 
deviations for the criterion ratings differed 
very little from group to group. Table 1 
provides the best estimates of these statistics. 


Discussion and Conclusions 


Interpreting the practical significance of 
validity coefficients is seldom a simple task; 
in the present case it is more complicated 
than usual. In addition to the usual prob- 
lems concerning the reliability and ultimate 
validity of the criterion, it should be noted 
that the criterion ratings used in this study 
were obtained by mail from many different 
companies. The only control exerted by the 
researchers was in the form of written in- 
structions. It is to be expected that a more 
carefully obtained criterion might lead to 
higher validities. 
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On the other hand, the possibility of cri- 
terion contamination cannot be overlooked. 
It was not practical to eliminate raters who 
had either seen the original report or been 
influenced by those who had seen it. How- 
ever, even where this was the case, a mini- 
mum of six months (and in some cases as 
long as two years) had elapsed between the 
time the report had been submitted and the 
evaluations rendered. Raters were specifically 
requested not to refer back to the report in 
making their ratings. 

When interpreting the test validities, it 
should be borne in mind that the sample is 
highly preselected. First, because of the na- 
ture of the jobs for which audits are finan- 
cially practical, auditees are mostly of above 
average ability and personality adjustment. 
Many have long experience in their vocational 
field, and natural selection has an effect. 
Further, the client companies carefully pre- 
screen all applicants and only promising ones 
are audited. This point does not bear on the 
interpretation of the efficiency of the over-all 
assessment technique because the technique 
is designed primarily for such samples, but 
the general efficiency of the standard tests is 
probably underestimated by this sample. 

In spite of this sharp curtailment of range, 
the magnitude of many of the coefficients 
indicates that the tests are useful in such 
populations. Aside from the magnitude of 
the coefficients, the encouraging fact evident 
in the table is that the pattern of relationships 
is almost entirely as would be predicted from 
general findings and interpretations of the 
various traits tested. For example, general 
mental ability as measured by the PRI Clas- 
sification Test correlates most highly with 
Scale P; (Potential) and Scale P. (Organi- 
zational Ability), next most with Scale P, 
(Over-all Performance), and least with Scale 
Ps (Drive) and Scale P,; (Sociability). The 
coefficients are too numerous to discuss in de- 
tail here, but the reader may determine from 
Table 3 which of the tests have usable va- 
lidity and which have the validity pattern 
expected from a traditional interpretation. 
There are some exceptions, of course, but 
these findings support in general the inter- 
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pretations suggested by the respective test 
manuals. 

If one thinks backward from test to cri- 
teria, these results suggest that the five cri- 
terion ratings are actualiy composed in part 
of different psychological factors, although 
the factor analysis does not clearly show this. 
On the basis of the factor analysis of the cri- 
terion ratings one would expect any validity 
obtained on any one of the last four ratings 
to be approximately duplicated on the other 
three. When this does not occur it may be 
because of chance variation or the fact that 
a particular test is valid against the specific 
variance of a given rating. The sociability 
rating has sufficient specific variance over and 
above the common factor variance to suggest 
that it will provide the greatest differences in 
correlational pattern, but other ratings also 
show varying patterns of correlation against 
the tests. 

In leaving the reader to interpret Table 3 
for himself, perhaps one word of caution will 
be helpful. Since the criterion was not well 


understood, and since the population was 
heterogeneous with respect to job classifica- 
tion, no specific predictions of the patterns 


of test validity were undertaken. In the ab- 
sence of such predictions it is very easy to 
capitalize on chance by perceiving neat ad 
hoc explanations or finding confirmation for 
hypotheses which suddenly become very fa- 
miliar and “have been known all along.” 
This danger is especially great when a large 
number of coefficients have been calculated. 
Not only will some individual coefficients be 
large by chance, but even some patterns of 
coefficients will appear by chance to fit a 
predicted pattern if there are enough obtained 
patterns. 

In Table 3 there are 38 correlations whose 
lower confidence limit exceeds .00; this is, of 
course, about five times as many as would be 
expected by chance at the 5% level. It is 
impossible, however, to say which of these 
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indicate true relationships and which are 
chance occurrences. It is very important, 
therefore, to exercise caution in interpreting 
the data of Table 3. With such caution and 
an awareness of the necessity for cross-vali- 
dation, however, the data of Table 3 may be 
used to improve the validity of the test bat- 
tery. (At PRI these data led to the reintro- 
duction of the Allport-Vernon and the Guil- 
ford-Zimmerman into most audits.) 

When interpreting the coefficients of Table 
1 as estimates of the efficiency of the over-all 
audit procedure, it should be remembered 
that the research predictors were not identi- 
cal with the operational predictors. That is, 
while selection was based on a descriptive re- 
port made by the psychologists conducting 
the assessment, other psychologists reduced 
the report to quantitative predictors. Un- 
doubtedly some error was introduced in this 
step, and again in interpreting the obtained 
results we expect that the practical validity 
would be higher. 

Because the sample here is extremely het- 
erogeneous in regard to job duties, any inter- 
pretation of the efficacy of the assessment 
technique for a specific job is somewhat tenu- 
ous. As further research is conducted, how- 
ever, the general validity will become better 
established and such specific interpretations 
can be made with greater confidence. 

In light of these limitations, the investi- 
gators feel that pending further research no 
attempt should be made to interpret precisely 
the efficiency of the assessment technique. 
Rather, these general conclusions are drawn: 


1. Compared with most validity findings 
these results are promising and indicate that 
the technique investigated has practical value 
and is definitely worthy of further research. 

2. When the research is more rigorously 
conducted, the resulting estimates of validity 
are likely to be higher. 
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In recent years the job satisfaction survey 
has become an increasingly popular manage- 
ment tool. It would seem reasonable to sup- 
pose that underlying the use of such surveys 
is the implicit assumption that high job satis- 
faction is in some way related to some aspect 
of corporate success. 

The recent literature would indicate that 
job satisfaction and certain objective per- 
formance measures are related (1, 4, 5, 8), 
although there is occasional research evidence 
(3) and theoretical discussion (6) to indi- 
cate that this is not necessarily so. How- 
ever, granting that a relationship between job 
satisfaction and performance does exist, when 
a company undertakes a morale audit it al- 
most invariably asks its employees a large 
number of quite specific questions and at- 
tempts te remedy the problem situations that 
these items uncover. Since several problem 
areas usually have a way of turning up, man- 
agement is faced with the task of assigning a 
priority among them. 

The present study was undertaken to deter- 
mine the nature of the relationship, if any, 
that exists between satisfaction or dissatisfac- 
tion with specific job aspects and perform- 
ance among life insurance agents. If it is 
found that specific items are differentially 
predictive of a relevant criterion, and that 
the percentage of dissatisfaction is not per- 
fectly related to the predictive ability of the 
items, then priority of action need not be es- 
tablished on the basis of frequency of dis- 
satisfaction alone, as is frequently the case, 
since the degree of relationship between the 
item and the criteria can be of help in mak- 
ing the necessary decisions. 


Method 


Initially 15 life insurance companies, ranging in 
size from very large to relatively small, participated 
in this study. These companies provided LIAMA 


1 This study developed from discussions held with 
the authors and Charles A. Waters, Leonard Fergu- 
son, and S. Rains Wallace, Jr. 
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with listings of all their agencies containing five or 
more full-time agents. The managers of these agen- 
cies were asked to provide the names and addresses 
of their agents. 


Obtaining a Pool of Items 


In order to obtain a reasonable and inclusive pool 
of items, a preliminary survey of agents’ attitudes 
was conducted. Three agencies were chosen at ran- 
dom from each of the participating companies. The 
agents were then mailed a questionnaire on which 
they were asked to list, as completely as possible, 
everything in their jobs which they considered to 
be major sources of satisfaction and dissatisfaction. 
Two additional questions were asked to obtain list- 
ings of those things that they believed were sources 
of satisfaction and dissatisfaction for other agents. 

In this phase of the study 495 questionnaires were 
mailed out and 108 were returned in a usable con- 
dition. While this number is small, it is believed 
that the responses provided a comprehensive listing 
of the things that satisfy and dissatisfy life insurance 
agents. Several depth interviews conducted with an 
additional sample of agents provided little that had 
not been obtained from the mail questionnaires. 

The “projective” items were included because it 
was felt that agents might be more willing to at- 
tribute dissatisfaction or unusual sources of satis- 
faction to other agents than to assume responsibility 
for the statements themselves. However, both the 
mail questionnaires and depth interviews proved this 
type of projective approach to be extremely barren, 
confirming an earlier study of the effectiveness of 
direct vs. indirect questions in job satisfaction studies 
(8). This does not mean that agents do not project. 
However, if they do project, they list their own satis- 
factions and dissatisfactions and then claim that other 
agents feel the way they do.. They did not attribute 
attitudes to others that they refused to attribute to 
themselves. 

The listings of satisfactions and dissatisfactions 
were edited to eliminate duplications and items that 
would apply only to a single company or agency. 
This analysis produced a pool of 104 satisfactions 
and 156 dissatisfactions. 


Building the Questionnaire 


The large number of items given by the agents, 
plus the way in which their statements were phrased, 
presented somewhat of a problem. To have tailor- 
made answer categories for each item would have 
produced too lengthy a questionnaire for mail ad- 
ministration. Uniform answer categories, such as a 
rating scale, would have done violence to the mean- 
ing of the statements in many instances; hence, check 
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lists were constructed in which the satisfactions and 
dissatisfactions were treated separately. The satis- 
factions were listed, roughly by topic covered, with 
a space provided before each item so that it could 
be checked if it were a major source of satisfaction 
to the responding agent. The dissatisfaction items 
were presented in a similar fashion. 

Actually, two questionnaires were constructed, 
since even in check-list form the number of items 
was too large to be handled in a single question- 
naire. To help give the appearance that each form 
was comprehensive, and to check on equivalence of 
returns, the 16 most frequently mentioned satisfac- 
tions and the 25 most frequent dissatisfactions were 
included on both forms. The questionnaires also 
contained several biographical items and the Gen- 
eral Satisfactions Test (7) in addition to the check 
lists. 

Fifty-five new agencies were randomly selected 
and the agents in half these agencies were mailed 
one form of the questionnaire, the agents in the 
other half receiving the second form. The items 
common to both torms were analyzed, and the re- 
sponse distributions indicated that, on the common 
items, the two samples of agents were highly equiva- 
lent. 

Three hundred and fifty-six usable replies were re- 
ceived from a mailing of 697. The percentage of 
agents checking each item was computed for each of 
the 11 companies. Items were rejected that had 
been checked by less than 5 per cent. Exceptions 
were made if an item showed fairly wide company 
differences. From the original pool of items, 58 
satisfactions and 82 dissatisfactions remained for in- 
clusion on the final form. 

An inspection of the agent-supplied items showed 
very few that would correspond to what the authors 
believed should be crucial indicators of high morale 
or job satisfaction. To fill this perceived void and 
to quiet a certain sense of uneasiness, two additional 
groups of questions were constructed. One of these 
groups contained items adapted from the File-Rem- 
mers test of supervisory ability, “How Supervise?” 2 
Items were chosen on two bases: those that dealt 
directly with supervisory practices and those that 
could be reworded so as to be answered from the 
viewpoint of the agent. The agent’s task was to 
indicate whether or not a particular supervisory 
practice was characteristic of his manager. The 
second group of questions contained six items de- 
signed to get at the agents’ psychological involve- 
ment with their managers. This questionnaire also 
contained the General Satisfactions Test. 


The Main Sample 


The agents were selected for the final phase in the 
same manner as in the previous two. However, for 
the final sample approximately 20 agencies were 
chosen at random from each of the 11 companies. 
It was not possible to select a full quota of 20 agen- 


2 This test was revised by permission of the Psy- 
chological Corporation. 
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cies from every company because their lists of agen- 
cies had been exhausted. Two thousand seven hun- 
dred and ten full-time agents from 202 agencies 
comprised the final population sample. 

Of the 2,710 agents, 990 returned questionnaires for 
a 37 per cent response. Although each nonrespond- 
ent received a single follow-up letter containing a 
duplicate copy of the questionnaire, no additional 
incentive was offered to increase the proportion of 
replies. The percentage of returns varied from a low 
of 22 in one company to a high of 46 for another. 

Mail questionnaires were used in this study since 
they provided the only feasible method of reaching 
a widely scattered sample of agents on a relatively 
limited budget. It is clearly recognized that the re- 
sponding agents do not constitute a random sample 
of the total group contacted. 

For certain segments of the sample it was possible 
to compare respondents with nonrespondents on sev- 
eral background characteristics. These comparisons 
indicated that older agents were less likely to re- 
spond; however, the other biographical data show 
that no major biases existed. The crucial biases are 
those involving attitudinal differences, and in that 
area we have no direct means of comparing the re- 
spondents with the nonrespondents. That 24 per 
cent of the nonrespondents terminated during the 
year following the administration of the question- 
naire as opposed to a termination rate of 11 per 
cent among the respondents, would indicate that we 
are dealing with a group that had made a some- 
what better job adjustment than the nonrespond- 
ents. Further, the respondents were more likely to 
be high producers than the nonrespondents. 


The Criterion and the Treatment of 
the Items 


The major criterion used in this study was 
survival or termination during the year sub- 
sequent to the time the agents received the 


questionnaires. A number of other perform- 
ance measures, including volume of Ordinary 
insurance sold, number of policies sold, gross 
and insurance income, etc., were also col- 
lected. However, for ease of discussion we 
are limiting ourselves to the relationship be- 
tween attitude measures and survival. 

The survival data were obtained directly 
from the companies. One year after the 
questionnaires were mailed, the companies 
were sent the listings of all their agents in- 
cluded in the original population sample. The 
companies then indicated which agents had 
died, retired, been promoted, terminated, etc. 
during the preceding year. For purposes of 
analysis, agents who retired, died, or entered 
military service have been excluded, as have 
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Table 1 
Matched Sample Distribution by Company, Age, and Length of Service 





—— 





Company §S T Age 


Total Years 
as Agent 





10 10 
17 17 
16 16 
7 7 
3 3 
14 14 
15 15 
0 0 
12 12 
3 3 
2 2 


20-24 
25-29 
30-34 
35-39 
40-44 
45-49 
50-54 
55-59 
60 & up 


mBROOMNIAUHE WHE 


_ 


Totals 99 99 99 





0-.99 
1-1.99 
2-2.99 
3-3.99 
4-4.99 
5-9.99 
10-14.99 
15-19.99 
20-24.99 
25-29.99 
Unknown* 


99 








Note.—S = survivors, T = terminators. 


* When total length of service as agent was unknown, match was made on basis of experience with present agency. 


all agents who reported spending 30 per cent 
or more of their time in supervision. 

One of the major complications in the 
analysis of these kinds of data is the fairly 
high interrelationship between the criterion, 
certain background data, and the item re- 
sponses. For example, the companies in this 
study differed significantly from one another 
in termination rate, and it was known that 
the agents in different companies responded 
differently to the various attitude items. 
Across companies there was a high relation- 
ship between response to various items and 
length of service. 

In order to partially overcome these diffi- 
culties, each terminator was matched with a 
survivor on the basis of length of service, 
age, and company. The unmatched survivors 
were then discarded from this analysis. The 
extent to which the two groups are com- 
parable can be seen from the data in Table 1. 

While the matched groups make it possible 
to compare the items with the criterion, tak- 
ing full advantage of all the terminators, the 
use of matched groups has a major disadvan- 
tage. The matched sample is not representa- 
tive, nor is purported to be representative, of 
the total population of agents in the 11 com- 
panies or even of the respondent sample. 
Hence, although it is not possible to project 
the findings to any over-all population, we 
can say that a particular response is, or is 


not, characteristic of survivors or termina- 
tors. It is not possible to estimate the degree 
to which the response and the criterion are re- 
lated in the population. 


The Results 


The major purpose of this study was to 
determine if expressions of satisfaction or 
dissatisfaction with specific job aspects are 
related to relevant performance, relevant per- 
formance being defined as survival or termi- 
nation during the year following the time the 
job satisfaction questionnaires were adminis- 
tered. 

A second purpose was to determine the re- 
lationship between the percentage of agents 
checking an item and the degree to which the 
item predicts the criterion. Of the 140 agent- 
supplied items, 21 were found to bear a sig- 
nificant relationship to the criterion, at least 
at the 5 per cent level of confidence. While 
the number of items showing significant dif- 
ferences between survivors and terminators is 
small, more items show differences than would 
be accounted for by chance. The items found 
to be significantly related to the criterion are 
shown in Table 2. 

Scores were obtained for the various sec- 
tions of the questionnaire. The scores for 
the satisfaction and dissatisfaction parts were 
simply the number of such items checked. 
The scores for “manager involvement” and 
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Table 2 
Attitude Items Significantly Related to Survival in the Matched Sample 








Survivors 
Checking 
N=99 


Terminators 
Checking 
N=99 


Diff./opier, 





Part I—Satisfaction Items 


The freedom from supervision in my job. 

Renewals as a source of deferred income. 

The helpfulness of my supervision. 

The way the home office shows its dependency on the individual 
agent. 

The personal friendship I have with the manager. 

The bulletins informing me of my work progress. 

Prospecting. 


Part II—Dissatisfaction Items 


The lack of special field training with the manager or supervisor. 

The public’s attitude toward life insurance agents. 

The uncertainty while getting established. 

Having my income or production figures made public. 

Having to use pressure and subterfuge in order to get a person 
to buy. 

Being expected to start producing on my own before completing 
training. 

The training I received from the agency staff. 

Not having paid vacation. 

Not having a floor to my income. 

The manager misrepresenting or failing to explain all the pro- 
visions of my contract. 

The irregularity of the hours I have to work. 

The amount of time the manager devotes to agents’ problems. 

The misrepresentation of my job and job possibilities by the 
manager during the hiring interview. 

Having the emphasis placed on volume, rather than quality of 
sales. 


56% 
63% 
62% 


707, 
37% 
70% 





Survivors 
Checking 
“Ves” 
N=99 





Part III—Manager Involvement 

Do you feel free to talk over any personal problems with your 
manager? 

Do you feel free to discuss any selling problems you might have 
with your manager? 

Does your manager make you feel you are doing a worth-while 
job? 

Is your manager the kind of person you enjoy being with socially ? 


94% 
99% 


90% 
92% 


34% 
42% 
41% 


20% 
53% 
36% 
23% 


23% 
45% 
47% 

8% 
15% 
18% 
15% 
15% 
19% 
16% 

9% 
12% 


19% 


19% 


Terminators 


Checking 


“Ves” 


N=99 


74% 
90% 


78% 
83% 





Survivors 
Checking 
“Character- 
istic” of Mgr. 
N=99 


Terminators 
Checking 
“Character- 
istic” of Mgr. 
N=99 





Part IV—Revised “How Supervise” 

Spends part of his time handling the personal problems and 
grievances of his agent. 

Gives each agent a detailed explanation of any changes in com- 
pany policy or procedure. 


83% 


88% 


2.97 
2.82 
2.82 


2.65 
2.45 
2.13 
2.01 


Diff. /onier. 
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the revised “How Supervise” are based on the 
number “right” minus the number “wrong,” 
plus a constant to eliminate negative scores. 

An additional score was computed for each 
agent. Agents in the total responding sam- 
ple were randomly assigned to one of two 
groups. Within each group the dissatisfac- 
tion items were related to the criterion and 
those showing significant differences were 
cross-validated on the other group. Items 
cross-validating in both random groups were 
then scored, giving each item unit weight, 
and these scores were related to the criterion 
in the matched sample. It would have been 
more desirable to have cross-validated the 
items found to be significant in the matched 
sample analysis; however, the size of the 
matched sample precluded its division into 
an original and a holdout group. Eleven 
items were cross-validated in the total sam- 
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ple, and of these, eight were also found to 
be significant when subjected to the matched 
sample controls. The remaining three items 
showed differences in the predicted direction 
in the matched sample, but the differences 
fell slightly short of significance at the 5 per 
cent level. However, all 11 items were scored 
on the matched sample. 

The relationship between the criterion and 
the various part scores is shown in Table 3. 
Also shown are the mean scores for the sur- 
vivors and terminators on the various parts 
of the questionnaire. It will be noted that in 
each instance the mean differences are in the 
expected direction and are all significant be- 
yond the 5 per cent level of confidence. 

A distribution of the percentage of agents 
checking each item was obtained for both 
the satisfactions and dissatisfactions. Biserial 
correlations ‘were computed between these 


Table 3 


Relationship Between Part Scores and Survival-Termination 








Score 

Part I 

Satis- 
factions 


Score 
Part II 
Dissatis- 


Terminators factions 


Score 
Part III 
Mgr. 
Involvement 


N_ Terminators N Terminators 





0-13 
14-19 
20-24 
25-31 
32-39 
40-58 


33 
25 
34 
33 


35 
38 


0-4 

5-7 

8-10 
11-14 
15-21 
22-28 


58% 
52% 
62% 
52% 
43% 


37% 





Score Part IV 
Rev. “How Sup.?” 
3-21 
22-24 
25-27 
28-29 
30-31 
32-37 


N Terminators 


27 70% 
18 44% 
36 58% 
39 46% 
41 46% 
36 39% 





2% 143 43% 
22-25 35 63% 
Lessthan22 20 80% 


33% 





Special Score 
“Cross-Validating” 


Terminators 
29% 
35% 
50% 
75% 
71% 





Mean for 


Score Survivors 





Part I 
Part II 
Part III 
Part IV 


Spec. Score 


Terminators 


Mean for 
(p<.05) 
2.75 
2.54 
3.11 
2.51 


5.18 


25.06 
13.72 
24.18 
26.56 


2.48 1.16 





* Significance of difference test based on formula for matched groups. 


See Edwards (2), pp. 276-277. 
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distributions and whether or not the item was 
significantly related to the criterion. The bi- 
serial for the satisfaction items was — .07, 
and for the dissatisfaction items, — .01. In 
neither case is the correlation significantly 
greater than zero. In other words, no rela- 
tionship was found between the percentage 
of agents checking an item and whether or 
not that item is significantly predictive of the 
criterion. If anything, the significant items 
are checked less frequently than the nonsig- 
nificant ones. 


Discussion of the Findings 


That the satisfaction or dissatisfaction 
agents feel toward certain aspects of their 
jobs is related to the criterion of survival 
would seem fairly well established. It would 
also seem established that the predictive items 
are by no means the ones checked most fre- 
quently as being sources of satisfaction or 
dissatisfaction. These results raise some in- 


teresting problems for the interpretation and 
methodology of job satisfaction surveys. 

In normal use the job satisfaction survey 
provides information about employees’ atti- 
tudes toward specific job features. 


The atti- 
tudes are usually treated as independent of 
one another and are thought of as indicating 
specific areas that may need remedial atten- 
tion. For example, if the present study had 
been completely anonymous, the only infor- 
mation available to the companies would 
have been the proportion of their agents who 
were satisfied and dissatisfied with the par- 
ticular items. Hence, taking account simply 
of the frequency of response, the job aspects 
accounting for the most dissatisfaction among 
life insurance agents would be: (a) the licens- 
ing of part-time agents, (b) people insuring 
their houses and cars but not their lives, (c) 
general insurance agents writing life insur- 
ance, and (d) the commission schedule; yet 
none of these items is related to the survival 
criterion. However, from the information 
provided by the relationships between atti- 
tude and performance, it would appear that 
the keys to a specific problem such as turn- 
over lie in the area of (a) training, (0d) 
agency level supervision, and (c) the inse- 
curity felt by new agents. Since many of 
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these items were checked by a small propor- 
tion of the agents, it is conceivable that in 
the usual application of satisfaction surveys, 
these job attitudes would be overlooked or 
attributed to “chronic gripers.” 

It could be argued that the predictive re- 
sponses were given by agents who were on 
the verge of quitting and are more or less ex- 
pressions of a “sour grapes” attitude. This 
latter argument would be stronger if the ter- 
minators indicated more dissatisfaction (or 
less satisfaction) with most aspects of their 
jobs. However, the predictive items defi- 
nitely tended to center around the immedi- 
ate supervisor and training. In general, the 
terminators liked their company, the “insti- 
tution of life insurance,” and their job duties 
just about as well as the survivors. The 
clustering of the predictive items into certain 
categories and not into others suggests that 
the decision to quit is at least not portrayed 
by general dissatisfaction with the job. 

To a certain extent the evidence that in- 
creasing dissatisfaction (or decreasing satis- 
faction) is linearly related to termination 
without differential item weights suggests that 
the agents’ reactions to specific items are not 
crucial indicators of causes of termination. 
An analysis of the item intercorrelations indi- 
cated that there were several clusterings of 
highly interrelated items.* That is, a large 
part of the response variance was probably 
attributable to the effect of a few underlying 
factors. At the present time a factor analy- 
sis of the agent-supplied items is being car- 
ried out. If meaningful factors are isolated, 
it should be possible to determine with some- 
what more precision which job aspects are 
most related to termination. However, if in- 
telligible factors are found, as the data to 
date suggest they will be, the implications for 
the typical anonymous satisfaction survey 
should be clear. That is, it would remain a 
useful instrument of communication between 
employee and employer, but its fruitfulness 
will be dependent upon further investigations 
which isolate those job aspects that relate to 
a criterion. 


8 Since a 140 X 140 matrix of intercorrelations is 
rather sizable, it has not been presented in this 
article. 
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The results of the study underline the im- 
portance of using behavioral criteria in the 
analysis of job satisfaction surveys, but this 
presupposes that it is possible to identify the 
individual respondents. Arguments could be 
raised that the respondents will answer dif- 
ferently when they know their responses can 
be identified than when they are assured 
anonymity. The only way to accept or re- 
ject this kind of argument is to gather ex- 
perimental evidence, comparing the responses 
obtained from signed vs. anonymous ques- 
tionnaires,* and also comparing the kinds of 
interpretation obtained from the use of indi- 
vidual vs. rough group criteria. In the mean- 
time, while signed questionnaires may pos- 
sibly produce somewhat biased results, we 
feel that the “validation” of job satisfaction 
responses is more productive of leads to the 
solution of morale problems and is less sub- 
ject to interpretative error than a census of 
employee attitudes using an anonymous ques- 
tionnaire without validation. This position, 
we feel, is strengthened by the finding of such 
a low relationship between predictiveness of 
an item and the proportion of respondents en- 
dorsing the item. 


Summary 


A survey of job satisfaction among Ordi- 
nary Life insurance agents indicated that cer- 
tain attitudes held by agents are significantly 
related to the criterion of survival-termina- 


4 This study is now being made. 
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tion. It was also found that the proportion 
of agents expressing dissatisfaction with a 
particular item was not related to whether or 
not that item was predictive of the criterion. 
The data show that the validation of signed 
job satisfaction questionnaires leads to a 
much different kind of interpretation of the 
responses than is obtained from anonymous 
questionnaires. 


Received March 28, 1955. 
Early Publication. 
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Book Reviews 


Saunders, Lyle. Cultural differences and 
medical care. New York: Russell Sage 
Foundation, 1954. Pp. 317. $4.50. 

This interesting volume, based on a socio- 
logic study of the cultural pattern of the 
Spanish-speaking populations of the South- 
west, attempts to analyze the barriers that 
limit their acceptance of modern health serv- 
ices and medical care. The author, a pro- 
fessor of sociology in a medical school, uses 
his findings on this group as a basis for gen- 
eral observations and conclusions as to the 
importance of a knowledge of cultural back- 
ground in the development of patient-physi- 
cian relationships and the planning of a com- 
munity health program. 

The first half of the book is an interesting 
though somewhat repetitious description of 
the various groups that make up the Span- 
ish-speaking segment of the American South- 
west. Their general characteristics, degree of 
acculturation, and their attitudes toward 
“Anglo” customs and institutions are sympa- 
thetically described. 

A short, and to this reviewer, somewhat in- 


adequate description of the medical customs 
and beliefs, introduces a discussion of meth- 
ods that have been and should be tried to 
bridge the gap that exists between the Span- 


ish and the Anglo cultures. Two unsuccess- 
ful experiments in medical care programs are 
described very briefly, with quite extensive 
conclusions as to reasons underlying their 
failure. Unfortunately many of the elements 
are only too reminiscent of the failures of 
similar programs among people of less di- 
vergent cultural patterns. 

Professor Saunders’ book is well written and 
interesting but may well fail to reach many 
of the readers in the medical profession for 
whom it might be most valuable. A similar 
volume, reduced to about half this length and 
with more attention to the medical compo- 
nent of the Spanish-American culture, might 
serve a better purpose among physicians, who, 
even more than the sociologist or anthropolo- 
gist, are in need of a volume of this character. 


Gaylord W. Anderson 


University of Minuesota 


Meehl, Paul E. Clinical versus statistical 
prediction: a theoretical analysis and re- 
view of the evidence. Minneapolis: Uni- 
ver. of Minnesota Press, 1954. Pp. x + 
149. $3.00. 


Since about 1940, there has been a continu- 
ing debate among psychologists regarding the 
relative accuracy and efficiency of statistical 
(actuarial) predictions and those made by 
clinicians on the basis of subjective “under- 
standing” of individual cases. Such parts of 
this debate as have appeared in print have 
been typically either pro or con and often 
characterized by a liberal use of evaluative 
adjectives. 

In spite of some 20 published researches 
bearing on the issues, the debate, both pub- 
lic and private, continues to be as lively as 
ever. And, whereas a few years ago only a 
handful of psychologists had taken a strong 
position, today—as the result of the postwar 
boom in clinical psychology—the protagonists 
number in the thousands. 

Although Professor Meehl has been known 
by his friends to have had a long-time inter- 
est in the problem, this book represents the 
author’s first published statement of his po- 
sition. As a “hybrid working clinician and 
rat psychologist” he believes himself in a fa- 
vorable position to treat the issues with rea- 
sonable objectivity. In this reviewer’s opin- 
ion, he has succeeded admirably. 

The result is a series of well-written essays, 
almost any one of which can be read and en- 
joyed in it own right: e.g., “The Rationality 
of Inference from Class Membership” and 
“The Special Powers of the Clinician.” More 
important, however, is the fact that the series 
constitutes an incisive logical analysis of the 
theoretical issues, a detailed review of the 
relevant evidence and a frank facing up to 
the resulting implications for clinical practice. 

In his effort to provide an objective analy- 
sis of the problem, the author has not re- 
treated to a middle of the road position, i.e., 
“everyone is right.” Instead he seeks to ana- 
lyze the over-all problem into a series of more 
clear-cut separate issues, with respect to each 
of which he takes a firm position. The re- 
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sult is that each reader, depending on his own 
bias, is likely to regard at least parts of the 
book as less than objective! 

Since available evidence fails to provide a 
rational basis for the alleged superiority of 
the clinical method of prediction, the author 
undertakes an analysis of the possible reasons 
for the widely held conviction that it is su- 
perior. In doing so, he emphasizes an impor- 
tant distinction between the tentative hy- 
potheses regarding specific behavior that the 
clinician makes during a therapy hour and 
predictions with respect to future perform- 
ance of an individual on some criterion. In 
the therapy situation, the clinician’s confi- 
dence in his predictive ability is reinforced if 
only an occasional hypothesis is verified. 
Furthermore, in this situation, the fact that 
many other hypotheses are not confirmed in 
no way detracts from the value of those that 
work. By contrast, in the prediction situa- 
tions, all poor hypotheses (incorrect beta 
weights) contribute to error variance and 
thus counteract the value of the good hy- 
potheses (correct beta weights). 

My own criticisms of this book are minor. 


The author admits that most of the manu- 
script was written by 1950 and subsequently 
modified only to incorporate some of the 


newer empirical evidence. Because the job is 
so generally well done, I wish it were more 
complete. For example, there is no reference 
to the OSS book which takes one of the 
strongest pro-clinical positions in print, to 
Eysenck’s equally strong arguments (and evi- 
dence) on the other side, or to the generally 
negative findings of the Menninger Founda- 
tion’s project which relied heavily on clinical 
predictions. Interesting, too, is the fact that 
the widely used terms “ideography” and 
“nomothetic” are not introduced into the dis- 
cussion. 

In spite of these and other omissions, the 
book is a little gem. If it is read as widely 
as it deserves to be read, it should serve to 
offset the loose thinking which characterizes 
all too many discussions of this important 
problem. 


E. Lowell Kelly 
University of Michigan 


Book Reviews 


Planty, Earl G., and Freeston, J. Thomas. De- 
veloping management ability. New York: 
Ronald Press, 1954. Pp. 447. $7.00. 

Six hundred questions varying in scope 

from “What is an information rack?” to 

“How can we get presidents and vice presi- 

dents to be willing to learn new things?,” 

with answers varying in length from a single 
sentence to several pages make up the book. 

There are 30 chapters which are grouped 
under an introduction and the four main 
headings of methods of development, types of 
development, organizing and operating de- 
velopment, and evaluation. There is a 104- 
item bibliography and a grouping of this 
bibliography under chapter headings. 

Practically every phase of determining the 
need for, planning, “selling,” and administer- 
ing development programs has been covered 
to some degree. Both group and individual 
methods of development are covered, includ- 
ing most of the newest techniques which have 
come into prominence only in recent years. 
Despite the title, there is much that would be 
of value to trainers working with people at 
nonmanagement levels or in nonbusiness 
situations as well. 

This is not a theoretical book, nor is any 
part of it devoted to the research upon which 
many management development methods have 
been built. It is, instead, a bringing together 
of much of the current thinking in the field of 
management development in a condensed form 
and stated, as the authors point out, in a 
factual and positive tone without claims of 
finality for methods or viewpoints. This will 
disturb some readers, who will quickly sense 
that some questions are given a single answer 
and with no mention made of what might be 
equally good alternatives. However, because 
the authors have drawn on many other leaders 
in the field of management development, the 
statements made often reflect the best think- 
ing on a particular problem in that field to- 
day. Moreover, many of the questions are 
noncontroversial in nature and call for direct 
and factual answers. 

To the person who is looking for answers to 
many of the questions that arise concerning 
management development without getting into 
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the theoretical implications or research back- 
ground, this book can be recommended highly. 


Theodore R. Lindbom 
Midland Cooperatives, Inc., 
Minneapolis, Minnesota 


Hersey, Rexford. Zest for work. New York: 

Harpers, 1955. 

The author has been at it for some 27 years. 
He has an earlier book on Workers’ Emotions, 
1932, and a somewhat similar publication in 
Germany. The present book includes con- 
siderable material from these previous studies 
and some additions. Much of it involves 
workers in railroad shops. A major tech- 
nique was daily interviews with individual 
workers. They told their experiences since 
the last interview and to some extent intro- 
spected as to their emotional level. A 12- 
point scale of emotions was frequently used. 
There was also personal observation and case- 
history material, and information obtained 
from supervisors. The typical presentation 
of this material is to indicate some possible 
relationship such as accidents and emotional 
conditions and then give detailed case studies 


of workers illustrating the relationship. These 
case studies involve numerous quotations from 
the workers. A case is summarized frequently 
with parallel columns for positive factors and 
negative factors and blocks for (a@) work fac- 
tors, (6) outside factors, (c) personality fac- 


tors, (d) influence of the past. There are 
tables giving, for example, production figures 
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or accidents for emotionally high and emo- 
tionally low periods. Hersey makes a lot out 
of these cyclical highs and lows, as in his 
earlier volume. Few other writers have 
caught these—perhaps because they did not 
do such exhaustive interview studies. 

The discussion corroborates the usual find- 
ings with reference to such things as security, 
congenial job, working conditions, good su- 
pervisor. It stresses the individual approach 
to employee relations, which would be ex- 
pected from a person with clinical orienta- 
tion. This point is convincing. 

Some readers may be interested in the case 
studies of a couple of shop stewards including 
detailed quotations. The author is optimistic 
about the role of the steward in industrial 
relations and suggests that the foreman will 
find it profitable to discuss many things with 
the steward in the informal stage before they 
require any formal action. He found his 
stewards sincerely interested in helping the 
workers and likewise willing to talk things 
over with the supervisor. 

While the work does not turn up any dra- 
matic discoveries, the case studies should in- 
terest some readers. The general reader may 
find that they “drag” a little, but the spe- 
cialist may appreciate the detailed protocols. 
It is quite appropriate for Hersey to pull to- 
gether in one book this rather extensive ac- 
cumulation of case material. 


Harold E. Burtt 
The Ohio State University 
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